spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jagaximo <takuya_seg...@dwango.co.jp>
Subject How to compute RDD[(String, Set[String])] that include large Set
Date Tue, 20 Jan 2015 03:38:10 GMT
i want compute RDD[(String, Set[String])] that include a part of large size
’Set[String]’.

--------------
val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle
read size 7GB)
val counted = reduced.map{ case (key, strSeq) => s”$key\t${strSeq.size}"}
counted.saveAsText(“/path/to/save/dir")
----------

Look Spark UI, In stage of saveAsText,  lost executor and starting resubmit.
then spark continue much lost executor.

i think, approach for this problem solving, make ‘RDD[(String,
RDD[String])]’ , union RDD[String], and distinct count. but create RDD in
RDD, NullPointerException has occured. maybe impossible this operation

What might be the issue and possible solution? 

please lend your wisdom






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message