spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjin Lee <dong...@apache.org>
Subject Re: What is correct behavior for spark.task.maxFailures?
Date Mon, 24 Apr 2017 15:02:46 GMT
Sumit,

I think the post below is describing the very case of you.

https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/

Regards,
Dongjin

--
Dongjin Lee

Software developer in Line+.
So interested in massive-scale machine learning.

facebook: http://www.facebook.com/dongjin.lee.kr
linkedin: http://kr.linkedin.com/in/dongjinleekr
github: http://github.com/dongjinleekr
twitter: http://www.twitter.com/dongjinleekr

On 22 Apr 2017, 5:32 AM +0900, Chawla,Sumit <sumitkchawla@gmail.com>, wrote:
> I am seeing a strange issue. I had a bad behaving slave that failed the entire job. I
have set spark.task.maxFailures to 8 for my job. Seems like all task retries happen on the
same slave in case of failure. My expectation was that task will be retried on different slave
in case of failure, and chance of all 8 retries to happen on same slave is very less.
>
>
> Regards
> Sumit Chawla
>

Mime
View raw message