spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jagaximo <>
Subject How to compute RDD[(String, Set[String])] that include large Set
Date Tue, 20 Jan 2015 03:38:10 GMT
i want compute RDD[(String, Set[String])] that include a part of large size

val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle
read size 7GB)
val counted ={ case (key, strSeq) => s”$key\t${strSeq.size}"}

Look Spark UI, In stage of saveAsText,  lost executor and starting resubmit.
then spark continue much lost executor.

i think, approach for this problem solving, make ‘RDD[(String,
RDD[String])]’ , union RDD[String], and distinct count. but create RDD in
RDD, NullPointerException has occured. maybe impossible this operation

What might be the issue and possible solution? 

please lend your wisdom

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message