spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "haiyangyu (Jira)" <>
Subject [jira] [Comment Edited] (SPARK-30297) Executor heartbeat expired cause app hung up forever
Date Wed, 18 Dec 2019 14:54:00 GMT


haiyangyu edited comment on SPARK-30297 at 12/18/19 2:53 PM:

[] [~AMateenM]

[~dongjoon]  [~yanboliang]

please look this patch ,thanks!

was (Author: yuhaiyang):
[] [~AMateenM]


please look this patch ,thanks!

> Executor heartbeat expired cause app hung up forever
> ----------------------------------------------------
>                 Key: SPARK-30297
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.4
>            Reporter: haiyangyu
>            Priority: Major
> h3. *Backgroud*
> The driver can't sense this executor was lost through the network connection disconnection
If an executor was lost in the network and it have not responsed rst and close packet to driver,
so driver can only sense this executor dead through heartbeat expired.
> h3. *Problems*
> Heartbeat expiration processing flow as follows:
>  # Executor heartbeat expired as above.
>  # HeartbeatReceiver will call scheduler executor lost to rescheduler the tasks on this
>  # HeartbeatReceiver kill the executor.
> The tasks on the dead executor have a chance to rescheduled on this dead executor again
if the task rescheduler before the executor has't remove from executorBackend, it will send
launch task to this executor again, the executor will not response and the driver can't sense
through heartbeat beause the executor has lost in network. This cause those tasks rescheduled
on this lost executor can't finish forever, and the app will hung up here forever.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message