As a followup on this, the memory footprint of all shuffle metadata has been  greatly reduced. For your original workload with 7k mappers, 7k reducers, and 5 machines, the total metadata size should have decreased from ~3.3 GB to ~80 MB.


On Tue, Oct 29, 2013 at 9:07 AM, Aaron Davidson <ilikerps@gmail.com> wrote:
Great! Glad to hear it worked out. Spark definitely has a pain point about deciding the right number of partitions, and I think we're going to be spending a lot of time trying to reduce that issue.

Currently working on the patch to reduce the shuffle file block overheads, but in the meantime, you can set "spark.shuffle.consolidateFiles=false" to exchange OOMEs due to too many partitions for worse performance (probably an acceptable tradeoff).



On Mon, Oct 28, 2013 at 2:31 PM, Stephen Haberman <stephen.haberman@gmail.com> wrote:
Hey guys,

As a follow up, I raised our target partition size to 600mb (up from
64mb), which split this report's 500gb of tiny S3 files into ~700
partitions, and everything ran much smoother.

In retrospect, this was the same issue we'd ran into before, having too
many partitions, and had previously solved by throwing some guesses at
coalesce to make it magically go away.

But now I feel like we have a much better understanding of why the
numbers need to be what they are, which is great.

So, thanks for all the input and helping me understand what's going on.

It'd be great to see some of the optimizations to BlockManager happen,
but I understand in the end why it needs to track what it does. And I
was also admittedly using a small cluster anyway.

- Stephen