Try using reduceByKeyLocally.
Lukas Nalezenec

On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
Make sure you set up enough reduce partitions so you don’t overload them. Another thing that may help is checking whether you’ve run out of local disk space on the machines, and turning on spark.shuffle.consolidateFiles to produce fewer files. Finally, there’s been a recent fix in both branch 0.9 and master that reduces the amount of memory used when there are small files (due to extra memory that was being taken by mmap()): https://issues.apache.org/jira/browse/SPARK-1145. You can find this in either the 1.0 release candidates on the dev list, or branch-0.9 in git.


On May 17, 2014, at 5:45 PM, Madhu <madhu@madhu.com> wrote:

> Daniel,
> How many partitions do you have?
> Are they more or less uniformly distributed?
> We have similar data volume currently running well on Hadoop MapReduce with
> roughly 30 nodes.
> I was planning to test it with Spark.
> I'm very interested in your findings.
> -----
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.