spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <>
Subject Re: What is correct behavior for spark.task.maxFailures?
Date Mon, 24 Apr 2017 16:48:05 GMT
Looking at the code a bit more, it appears that blacklisting is disabled by
default. To enable it, set spark.blacklist.enabled=true.

The updates in 2.1.0 appear to provide much more fine-grained settings for
this, like the number of tasks that can fail before an executor is
blacklisted for a stage. In that version, you probably want to set
spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
<> and search for
“blacklist” to see all the options.


On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <> wrote:

> Chawla,
> We hit this issue, too. I worked around it by setting spark.scheduler.
> executorTaskBlacklistTime=5000. The problem for us was that the scheduler
> was using locality to select the executor, even though it had already
> failed there. The executor task blacklist time controls how long the
> scheduler will avoid using an executor for a failed task, which will cause
> it to avoid rescheduling on the executor. The default was 0, so the
> executor was put back into consideration immediately.
> In 2.1.0 that setting has changed to spark.blacklist.timeout. I’m not
> sure if that does exactly the same thing. The default for that setting is
> 1h instead of 0. It’s better to have a non-zero default to avoid what
> you’re seeing.
> rb
> ​
> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit <>
> wrote:
>> I am seeing a strange issue. I had a bad behaving slave that failed the
>> entire job.  I have set spark.task.maxFailures to 8 for my job.  Seems
>> like all task retries happen on the same slave in case of failure.  My
>> expectation was that task will be retried on different slave in case of
>> failure, and chance of all 8 retries to happen on same slave is very less.
>> Regards
>> Sumit Chawla
> --
> Ryan Blue
> Software Engineer
> Netflix

Ryan Blue
Software Engineer

View raw message