spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhusudanan Kandasamy <>
Subject Question on DAGScheduler.getMissingParentStages()
Date Tue, 08 Sep 2015 15:00:37 GMT


I'm new to SPARK, trying to understand the DAGScheduler code flow. As per
my understanding it looks like getMissingParentStages() doing a redundant
job of re-calculating stage dependencies. When the first stage is created
all of its dependent/parent stages would be recursively calculated and
stored in stage.parents member. Whenever any given stage needs to be
submitted, it would call getMissingParentStages() to get list of all
un-computed parent stages.

I've expected that getMissingParentStages() would go through stage.parents
and retrieve information about whether they are already computed or not.
However, this function does another graph traversal from the stage.rdd
which seems unnecessary. Is there any specific reason to design like that?
If not, I would like to redesign getMissingParentStages() avoiding the
graph traversal.

View raw message