spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vivek YS <vivek...@gmail.com>
Subject GroupByKey results in OOM - Any other alternative
Date Sat, 14 Jun 2014 17:58:58 GMT
Hi,
   For last couple of days I have been trying hard to get around this
problem. Please share any insights on solving this problem.

Problem :
There is a huge list of (key, value) pairs. I want to transform this to
(key, distinct values) and then eventually to (key, distinct values count)

On small dataset

groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1,
x._2.distinct.count))

On large data set I am getting OOM.

Is there a way to represent Seq of values from groupByKey as RDD and then
perform distinct over it ?

Thanks
Vivek

Mime
View raw message