spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harry Brundage (JIRA)" <>
Subject [jira] [Commented] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
Date Tue, 16 Dec 2014 15:23:13 GMT


Harry Brundage commented on SPARK-4732:

Seems like it would, feel free to mark as duplicate!

> All application progress on the standalone scheduler can be halted by one systematically
faulty node
> ----------------------------------------------------------------------------------------------------
>                 Key: SPARK-4732
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>         Environment:  - Spark Standalone scheduler
>            Reporter: Harry Brundage
> We've experienced several cluster wide outages caused by unexpected system wide faults
on one of our spark workers if that worker is failing systematically. By systematically, I
mean that every executor launched by that worker will definitely fail due to some reason out
of Spark's control like the log directory disk being completely out of space, or a permissions
error for a file that's always read during executor launch. We screw up all the time on our
team and cause stuff like this to happen, but because of the way the standalone scheduler
allocates resources, our cluster doesn't recover gracefully from these failures. 
> When there are more tasks to do than executors, I am pretty sure the way the scheduler
works is that it just waits for more resource offers and then allocates tasks from the queue
to those resources. If an executor dies immediately after starting, the worker monitor process
will notice that it's dead. The master will allocate that worker's now free cores/memory to
a currently running application that is below its spark.cores.max, which in our case I've
observed as usually the app that just had the executor die. A new executor gets spawned on
the same worker that the last one just died on, gets allocated that one task that failed,
and then the whole process fails again for the same systematic reason, and lather rinse repeat.
This happens 10 times or whatever the max task failure count is, and then the whole app is
deemed a failure by the driver and shut down completely.
> This happens to us for all applications in the cluster as well. We usually run roughly
as many cores as we have hadoop nodes. We also usually have many more input splits than we
have tasks, which means the locality of the first few tasks which I believe determines where
our executors run is well spread out over the cluster, and often covers 90-100% of nodes.
This means the likelihood of any application getting an executor scheduled any broken node
is quite high. After an old application goes through the above mentioned process and dies,
the next application to start or not be at it's requested max capacity gets an executor scheduled
on the broken node, and is promptly taken down as well. This happens over and over as well,
to the point where none of our spark jobs are making any progress because of one tiny permissions
mistake on one node.
> Now, I totally understand this is usually an "error between keyboard and screen" kind
of situation where it is the responsibility of the people deploying spark to ensure it is
deployed correctly. The systematic issues we've encountered are almost always of this nature:
permissions errors, disk full errors, one node not getting a new spark jar from a configuration
error, configurations being out of sync, etc. That said, disks are going to fail or half fail,
fill up, node rot is going to ruin configurations, etc etc etc, and as hadoop clusters scale
in size this becomes more and more likely, so I think its reasonable to ask that Spark be
resilient to this kind of failure and keep on truckin'. 
> I think a good simple fix would be to have applications, or the master, blacklist workers
(not executors) at a failure count lower than the task failure count. This would also serve
as a belt and suspenders fix for SPARK-4498.
>  If the scheduler stopped trying to schedule on nodes that fail a lot, we could still
make progress. These blacklist events are really important and I think would need to be well
logged and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the
tradeoff here is that you risk blacklisting ever worker as well if there is something systematically
wrong with communication or whatever else I can't imagine.
> Please let me know if I've misunderstood how the scheduler works or you need more information
or anything like that and I'll be happy to provide. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message