spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-3289) Prevent complete job failures due to rescheduling of failing tasks on buggy machines
Date Thu, 28 Aug 2014 22:21:09 GMT
Josh Rosen created SPARK-3289:
---------------------------------

             Summary: Prevent complete job failures due to rescheduling of failing tasks on
buggy machines
                 Key: SPARK-3289
                 URL: https://issues.apache.org/jira/browse/SPARK-3289
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Josh Rosen


Some users have reported issues where a task fails due to an environment / configuration issue
on some machine, then the task is reattempted _on that same buggy machine_ until the entire
job failures because that single task has failed too many times.

To guard against this, maybe we should add some randomization in how we reschedule failed
tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message