spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumya Simanta <soumya.sima...@gmail.com>
Subject Re: Fwd: Is there a way to load a large file from HDFS faster into Spark
Date Sun, 11 May 2014 13:17:07 GMT
Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.



On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:

> .gz files are not splittable hence harder to process. Easiest is to move
> to a splittable compression like lzo and break file into multiple blocks to
> be read and for subsequent processing.
> On 11 May 2014 09:01, "Soumya Simanta" <soumya.simanta@gmail.com> wrote:
>
>>
>>
>> I've a Spark cluster with 3 worker nodes.
>>
>>
>>    - *Workers:* 3
>>    - *Cores:* 48 Total, 48 Used
>>    - *Memory:* 469.8 GB Total, 72.0 GB Used
>>
>> I want a process a single file compressed (*.gz) on HDFS. The file is
>> 1.5GB compressed and 11GB uncompressed.
>> When I try to read the compressed file from HDFS it takes a while (4-5
>> minutes) load it into an RDD. If I use the .cache operation it takes even
>> longer. Is there a way to make loading of the RDD from HDFS faster ?
>>
>> Thanks
>>  -Soumya
>>
>>
>>

Mime
View raw message