spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 瀬川 卓也 <>
Subject Data are partial to a specific partition after sort
Date Thu, 29 Jan 2015 02:57:04 GMT
For example, We consider the word count of the long text data (100GB order).  There is clearly
a bias for the word , has been expected to be a long tail data do word count. Probably word
number 1 occupies about  over 1 / 10.

word count code

val allWordLineSplited: RDD[String] = // create RDD …
val wordRDD = allWordLineSplited.flatMap(_.split(“ “))
val count =, 1)).reduceByKey(_ + _)
val sort ={case (word, count) => (count, word)}.sortByKey()

sort.saveAsText(“/path/to/save/dir/“, classOf[GZipCodec])

It is considered that by focusing on the sort. this case, All data of word number 1 gather
in a partition. and Shuffle Read become very large size. After the other, and waiting Executor
has lost…

Although I want to equalize the size of the post if possible sort. 
If the technique that does not exist, how will the better to calculate?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message