spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jiang Xingbo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again
Date Mon, 11 Jun 2018 22:05:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508832#comment-16508832
] 

Jiang Xingbo commented on SPARK-24387:
--------------------------------------

{quote}So I think there's a race condition that the backend may make offers before killing
the executor. And since this is the only executor left, it's offered to the TaskScheduler
and the retried task is scheduled to it.{quote}
IIUC removing an executor due to heartbeat timeout will be treated as a SlaveLost, which shall
encounter a taskFailure for each task running on that executor, and therefore blacklist the
task from running again on that executor, so why can we offer the executor to the retried
task again?

> Heartbeat-timeout executor is added back and used again
> -------------------------------------------------------
>
>                 Key: SPARK-24387
>                 URL: https://issues.apache.org/jira/browse/SPARK-24387
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Rui Li
>            Priority: Major
>
> In our job, when there's only one task and one executor running, the executor's heartbeat
is lost and driver decides to remove it. However, the executor is added again and the task's
retry attempt is scheduled to that executor, almost immediately after the executor is marked
as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message