spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <>
Subject ShuffleMapStage and pendingPartitions vs isAvailable or findMissingPartitions?
Date Sun, 26 Apr 2020 09:53:09 GMT

I found that ShuffleMapStage has this (apparently superfluous)
pendingPartitions registry [1] for DAGScheduler and the description says:

"  /**
   * Partitions that either haven't yet been computed, or that were
computed on an executor
   * that has since been lost, so should be re-computed.  This variable is
used by the
   * DAGScheduler to determine when a stage has completed. Task successes
in both the active
   * attempt for the stage or in earlier attempts for this stage can cause
paritition ids to get
   * removed from pendingPartitions. As a result, this variable may be
inconsistent with the pending
   * tasks in the TaskSetManager for the active attempt for the stage (the
partitions stored here
   * will always be a subset of the partitions that the TaskSetManager
thinks are pending).

I'm curious why there is a need for this pendingPartitions
since isAvailable or findMissingPartitions (using MapOutputTrackerMaster)
know it already and I think are even more up-to-date. Why is there this
extra registry?


Jacek Laskowski
"The Internals Of" Online Books <>
Follow me on


View raw message