spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <>
Subject [jira] [Commented] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
Date Wed, 01 Aug 2018 15:45:00 GMT


Thomas Graves commented on SPARK-24909:

looking more I think the fix may actually just be to revert the change from SPARK-19263,
so that it always does shuffleStage.pendingPartitions -= task.partitionId.   The change
in SPARK-23433, should fix the issue originaly from SPARK-19263.

If we always remove it from the pendingPartitions and the map output isn't there it will resubmit
the stage.  SPARK-23433, since its marking all tasks in other stage attempts as complete
should make sure no other active stages for that are running.

Need to investigate more and run some tests.


> Spark scheduler can hang when fetch failures, executor lost, task running on lost executor,
and multiple stage attempts
> -----------------------------------------------------------------------------------------------------------------------
>                 Key: SPARK-24909
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.3.1
>            Reporter: Thomas Graves
>            Priority: Critical
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and all the
tasks in the tasks sets are marked as completed. ([]
> It never creates new task attempts in the task scheduler but the dag scheduler still
has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 44.0 (TID
970752,, executor 33, partition 55769, PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 (repartition
at Lift.scala:191) as failed due to a fetch failure from ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 (map at
foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due to fetch failure
> ....
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 33 (epoch
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 (MapPartitionsRDD[70] at
repartition at bar.scala:191), which has no missing parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 59955
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 44.0
(TID 970752) in 101505 ms on (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus ShuffleMapTask(44,
55769) completion from executor 33{code}
> In the logs above you will see that task 55769.0 finished after the executor was lost
and a new task set was started.  The DAG scheduler says "Ignoring possibly bogus".. but in
the TaskSetManager side it has marked those tasks as completed for all stage attempts. The
DAGScheduler gets hung here.  I did a heap dump on the process and can see that 55769 is
still in the DAGScheduler pendingPartitions list but the tasksetmanagers are all complete

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message