spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yana Kadiyska <yana.kadiy...@gmail.com>
Subject Help alleviating OOM errors
Date Mon, 30 Jun 2014 14:39:50 GMT
Hi,

our cluster seems to have a really hard time with OOM errors on the
executor. Periodically we'd see a task that gets sent to a few
executors, one would OOM, and then the job just stays active for hours
(sometimes 30+ whereas normally it completes sub-minute).

So I have a few questions:

1. Why am I seeing OOMs to begin with?

I'm running with defaults for
spark.storage.memoryFraction
spark.shuffle.memoryFraction

so my understanding is that if Spark exceeds 60% of available memory,
data will be spilled to disk? Am I misunderstanding this? In the
attached screenshot, I see a single stage with 2 tasks on the same
executor -- no disk spills but OOM.

2. How can I reduce the likelyhood of seeing OOMs -- I am a bit
concerned that I don't see a spill at all so not sure if decreasing
spark.storage.memoryFraction is what needs to be done

3. Why does an OOM seem to break the executor so hopelessly? I am
seeing times upwards of 30hrs once an OOM occurs. Why is that -- the
task *should* take under a minute, so even if the whole RDD was
recomputed from scratch, 30hrs is very mysterious to me. Hadoop can
process this in about 10-15 minutes, so I imagine even if the whole
job went to disk it should still not take more than an hour

Any insight into this would be much appreciated.
Running Spark 0.9.1

Mime
View raw message