spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Configuring Spark for reduceByKey on on massive data sets
Date Sun, 18 May 2014 01:33:49 GMT
Make sure you set up enough reduce partitions so you don’t overload them. Another thing that
may help is checking whether you’ve run out of local disk space on the machines, and turning
on spark.shuffle.consolidateFiles to produce fewer files. Finally, there’s been a recent
fix in both branch 0.9 and master that reduces the amount of memory used when there are small
files (due to extra memory that was being taken by mmap()):
You can find this in either the 1.0 release candidates on the dev list, or branch-0.9 in git.


On May 17, 2014, at 5:45 PM, Madhu <> wrote:

> Daniel,
> How many partitions do you have?
> Are they more or less uniformly distributed?
> We have similar data volume currently running well on Hadoop MapReduce with
> roughly 30 nodes. 
> I was planning to test it with Spark. 
> I'm very interested in your findings. 
> -----
> Madhu
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message