Hi!

I tried this (by setting spark.task.maxFailures to 1) and it still does not fail-fast. I started a job and after some time, I killed all JVMs running on one of the two workers. I was expecting Spark job to fail, however it re-fetched tasks to one of the two workers that was still alive and the job succeeded. 

Grega
--
Inline image 1
Grega Kešpret
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com | @celtramobile


On Mon, Dec 9, 2013 at 10:43 AM, Grega Kešpret <grega@celtra.com> wrote:
Hi Reynold,

I submitted a pull request here - https://github.com/apache/incubator-spark/pull/245
Do I need to do anything else (perhaps add a ticket in JIRA)?

Best,
Grega
--
Inline image 1
Grega Kešpret

Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com | @celtramobile


On Fri, Nov 29, 2013 at 6:24 PM, Reynold Xin <rxin@apache.org> wrote:
Looks like a bug to me. Can you submit a pull request?



On Fri, Nov 29, 2013 at 2:02 AM, Grega Kešpret <grega@celtra.com> wrote:

> Looking at
> http://spark.incubator.apache.org/docs/latest/configuration.html
> docs says:
> Number of individual task failures before giving up on the job. Should be
> greater than or equal to 1. Number of allowed retries = this value - 1.
>
> However, looking at the code
>
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala#L532
>
> if I set spark.task.maxFailures to 1, this means that job will fail after
> task fails for the second time. Shouldn't this line be corrected to if (
> numFailures(index) >= MAX_TASK_FAILURES) {
> ?
>
> I can open a pull request if this is the case.
>
> Thanks,
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>