spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Jung <>
Subject Re: How to compute RDD[(String, Set[String])] that include large Set
Date Tue, 20 Jan 2015 04:57:03 GMT
As far as I know, the tasks before calling saveAsText  are transformations so
that they are lazy computed. Then saveAsText action performs all
transformations and your Set[String] grows up at this time. It creates large
collection if you have few keys and this causes OOM easily when your
executor memory and fraction settings are not suitable for computing this.
If you want only collection counts by keys , you can use countByKey() or
map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
RDD to make reduceByKey collect only counts of keys.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message