spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harry Brundage (JIRA)" <>
Subject [jira] [Created] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
Date Thu, 04 Dec 2014 00:04:12 GMT
Harry Brundage created SPARK-4732:

             Summary: All application progress on the standalone scheduler can be halted by
one systematically faulty node
                 Key: SPARK-4732
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.1.0, 1.2.0
         Environment:  - Spark Standalone scheduler
            Reporter: Harry Brundage

We've experienced several cluster wide outages caused by unexpected system wide faults on
one of our spark workers if that worker is failing systematically. By systematically, I mean
that every executor launched by that worker will definitely fail due to some reason out of
Spark's control like the log directory disk being completely out of space, or a permissions
error for a file that's always read during executor launch. We screw up all the time on our
team and cause stuff like this to happen, but because of the way the standalone scheduler
allocates resources, our cluster doesn't recover gracefully from these failures. 

Correct me if I am wrong but when there are more tasks to do than executors, I am pretty sure
the way the scheduler works is that it just waits for more resource offers and then allocates
tasks from the queue to those resources. If an executor dies immediately after starting, the
worker monitor process will notice that it's dead. The master will allocate that worker's
now free cores/memory to a currently running application that is below its spark.cores.max,
which in our case I've observed as usually the app that just had the executor die. A new executor
gets spawned on the same worker that the last one just died on, gets allocated that one task
that failed, and then the whole process fails again for the same systematic reason, and lather
rinse repeat. This happens 10 times or whatever the max task failure count is, and then the
whole app is deemed a failure by the driver and shut down completely.

For us, we usually run roughly as many cores as we have hadoop nodes. We also usually have
many more input splits than we have tasks, which means the locality of the first few tasks
which I believe determines where our executors run is well spread out over the cluster, and
often covers 90-100% of nodes. This means the likelihood of any application getting an executor
scheduled any broken node is quite high. So, in my experience, after an application goes through
the above mentioned process and dies, the next application to start or not be at it's requested
max capacity gets an executor scheduled on the broken node, and is promptly taken down as
well. This happens over and over as well, to the point where none of our spark jobs are making
any progress because of one tiny permissions mistake on one node.

Now, I totally understand this is usually an "error between keyboard and screen" kind of situation
where it is the responsibility of the people deploying spark to ensure it is deployed correctly.
The systematic issues we've encountered are almost always of this nature: permissions errors,
disk full errors, one node not getting a new spark jar from a configuration error, configurations
being out of sync, etc. That said, disks are going to fail or half fail, fill up, node rot
is going to ruin configurations, etc etc etc, and as hadoop clusters scale in size this becomes
more and more likely, so I think its reasonable to ask that Spark be resilient to this kind
of failure and keep on truckin'. 

I think a good simple fix would be to have applications, or the master, blacklist workers
(not executors) at a failure count lower than the task failure count. This would also serve
as a belt and suspenders fix for SPARK-4498.
 If the scheduler stopped trying to schedule on nodes that fail a lot, we could still make
progress. These blacklist events are really important and I think would need to be well logged
and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the tradeoff
here is that you risk blacklisting ever worker as well if there is something systematically
wrong with communication or whatever else I can't imagine.

Please let me know if I've misunderstood how the scheduler works or you need more information
or anything like that and I'll be happy to provide. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message