spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhu <>
Subject Detecting configuration problems
Date Sun, 06 Sep 2015 15:23:54 GMT
I'm not sure if this has been discussed already, if so, please point me to
the thread and/or related JIRA.

I have been running with about 1TB volume on a 20 node D2 cluster (255
I have uniformly distributed data, so skew is not a problem.

I found that default settings (or wrong setting) for driver and executor
memory caused out of memory exceptions during shuffle (subtractByKey to be
exact). This was not easy to track down, for me at least.

Once I bumped up driver to 12G and executor to 10G with 300 executors and
3000 partitions, shuffle worked quite well (12 mins for subtractByKey). I'm
sure there are more improvement to made, but it's a lot better than heap
space exceptions!

>From my reading, the shuffle OOM problem is in ExternalAppendOnlyMap or
similar disk backed collection.
I have some familiarity with that code based on previous work with external

Is it possible to detect misconfiguration that leads to these OOMs and
produce a more meaningful error messages? I think that would really help
users who might not understand all the inner workings and configuration of
Spark (myself included). As it is, heap space issues are a challenge and
does not present Spark in a positive light.

I can help with that effort if someone is willing to point me to the precise
location of memory pressure during shuffle.


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message