spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin (Sangwoo) Kim" <>
Subject Re: How to compute RDD[(String, Set[String])] that include large Set
Date Tue, 20 Jan 2015 05:17:47 GMT
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.

(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.


On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung <> wrote:

> As far as I know, the tasks before calling saveAsText  are transformations
> so
> that they are lazy computed. Then saveAsText action performs all
> transformations and your Set[String] grows up at this time. It creates
> large
> collection if you have few keys and this causes OOM easily when your
> executor memory and fraction settings are not suitable for computing this.
> If you want only collection counts by keys , you can use countByKey() or
> map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
> RDD to make reduceByKey collect only counts of keys.
> --
> View this message in context: http://apache-spark-user-list.
> that-include-large-Set-tp21248p21251.html
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message