Thanks for the input. I will give foldByKey a shot.
The way I am doing is, data is partitioned hourly. So I am computing
distinct values hourly. Then I use unionRDD to merge them and compute
distinct on the overall data.
> Is there a way to know which key,value pair is resulting in the OOM ?
> Is there a way to set parallelism in the map stage so that, each worker
will process one key at time. ?
I didn't realise countApproxDistinctByKey is using hyperloglogplus. This
should be interesting.
Vivek
On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen <sowen@cloudera.com> wrote:
> Grouping by key is always problematic since a key might have a huge number
> of values. You can do a little better than grouping *all* values and *then*
> finding distinct values by using foldByKey, putting values into a Set. At
> least you end up with only distinct values in memory. (You don't need two
> maps either, right?)
>
> If the number of distinct values is still huge for some keys, consider the
> experimental method countApproxDistinctByKey:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285
>
> This should be much more performant at the cost of some accuracy.
>
>
> On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS <vivek.ys@gmail.com> wrote:
>
>> Hi,
>> For last couple of days I have been trying hard to get around this
>> problem. Please share any insights on solving this problem.
>>
>> Problem :
>> There is a huge list of (key, value) pairs. I want to transform this to
>> (key, distinct values) and then eventually to (key, distinct values count)
>>
>> On small dataset
>>
>> groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1,
>> x._2.distinct.count))
>>
>> On large data set I am getting OOM.
>>
>> Is there a way to represent Seq of values from groupByKey as RDD and then
>> perform distinct over it ?
>>
>> Thanks
>> Vivek
>>
>
>
