spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vivek YS <>
Subject Re: GroupByKey results in OOM - Any other alternative
Date Sun, 15 Jun 2014 04:08:48 GMT
Thanks for the input. I will give foldByKey a shot.

The way I am doing is, data is partitioned hourly. So I am computing
distinct values hourly. Then I use unionRDD to merge them and compute
distinct on the overall data.

> Is there a way to know which key,value pair is resulting in the OOM ?
> Is there a way to set parallelism in the map stage so that, each worker
will process one key at time. ?

I didn't realise countApproxDistinctByKey is using hyperloglogplus. This
should be interesting.


On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen <> wrote:

> Grouping by key is always problematic since a key might have a huge number
> of values. You can do a little better than grouping *all* values and *then*
> finding distinct values by using foldByKey, putting values into a Set. At
> least you end up with only distinct values in memory. (You don't need two
> maps either, right?)
> If the number of distinct values is still huge for some keys, consider the
> experimental method countApproxDistinctByKey:
> This should be much more performant at the cost of some accuracy.
> On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS <> wrote:
>> Hi,
>>    For last couple of days I have been trying hard to get around this
>> problem. Please share any insights on solving this problem.
>> Problem :
>> There is a huge list of (key, value) pairs. I want to transform this to
>> (key, distinct values) and then eventually to (key, distinct values count)
>> On small dataset
>> groupByKey().map( x => (x_1, x._2.distinct)) => (x_1,
>> x._2.distinct.count))
>> On large data set I am getting OOM.
>> Is there a way to represent Seq of values from groupByKey as RDD and then
>> perform distinct over it ?
>> Thanks
>> Vivek

View raw message