spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denny Lee <denny.g....@gmail.com>
Subject Re: Spark Shell slowness on Google Cloud
Date Thu, 18 Dec 2014 06:46:06 GMT
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans?

On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <alexbaretta@gmail.com>
wrote:

> Denny,
>
> No, gsutil scans through the listing of the bucket quickly. See the
> following.
>
> alex@hadoop-m:~/split$ time bash -c "gsutil ls
> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>
> 6860
>
> real    0m6.971s
> user    0m1.052s
> sys     0m0.096s
>
> Alex
>
>
> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee <denny.g.lee@gmail.com> wrote:
>>
>> I'm curious if you're seeing the same thing when using bdutil against
>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>> Spark -> Hadoop -> GCS Connector -> GCS.
>>
>>
>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>> alexbaretta@gmail.com> wrote:
>>
>>> All,
>>>
>>> I'm using the Spark shell to interact with a small test deployment of
>>> Spark, built from the current master branch. I'm processing a dataset
>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>> half dozen directories. My code constructs an object--let me call it the
>>> Dataset object--that defines a distinct RDD for each directory. The
>>> constructor of the object only defines the RDDs; it does not actually
>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>> logging code in the constructor prints a line signaling the completion of
>>> the code almost immediately after invocation, but the Spark shell does not
>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>> frozen, eventually producing the following output:
>>>
>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>> process : 9
>>>
>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>> process : 759
>>>
>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>> process : 228
>>>
>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>> process : 3076
>>>
>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>> process : 1013
>>>
>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>> process : 156
>>>
>>> This stage is inexplicably slow. What could be happening?
>>>
>>> Thanks.
>>>
>>>
>>> Alex
>>>
>>

Mime
View raw message