spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: flatMap followed by mapPartitions
Date Fri, 14 Nov 2014 17:32:33 GMT
mapPartitions tried to hold data is memory which did not work for me..

I am doing flatMap followed by groupByKey now with HashPartitioner and
number of blocks is 60 (Based on 120 cores I am running the job on)...

Now when the shuffle size < 100 GB it works fine...as flatMap shuffle goes
to 200 GB, 400 GB...I am getting:

FetchFailed(BlockManagerId(1, istgbd013.verizon.com, 44377, 0),
shuffleId=37, mapId=8, reduceId=54)

I have to shuffle because the memory on cluster is less than the shuffle
size of 400 GB..

The job runs fine if I sample and decrease my shuffle size within 100 GB..

Does groupByKey does a combiner similar to reduceByKey and aggregateByKey ?
I need a combiner operation to do some work on map side after flatMap
followed by rest of the work on reducers..

On Wed, Nov 12, 2014 at 8:35 PM, Mayur Rustagi <mayur.rustagi@gmail.com>
wrote:

> flatmap would have to shuffle data only if output RDD is expected to be
> partitioned by some key.
> RDD[X].flatmap(X=>RDD[Y])
> If it has to shuffle it should be local.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <debasish.das83@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am doing a flatMap followed by mapPartitions to do some blocked
>> operation...flatMap is shuffling data but this shuffle is strictly
>> shuffling to disk and not over the network right ?
>>
>> Thanks.
>> Deb
>>
>
>

Mime
View raw message