spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Pettini <>
Subject [Spark Core] Getting the number of stages a job is made of
Date Fri, 23 Mar 2018 11:20:31 GMT
 Hi everybody,

this is my first message to the mailing list. In a world of DataFrames and
Structured Streaming my use cases may be considered kind of corner cases,
but still I think it's important to address such problems and go deep in
understanding how Spark RDDs work.

We have an application that synthesizes complex spark jobs, using the RDD
interface, easily reaching hundreds of RDDs and stages for each single job.
Due to the nature of the functions passed to mapPartitions & co, we need
precise control on the characteristics of the job Spark constructs when an
action is invoked on the last RDD, like what is shuffled and what is not,
the number of stages, fine tuning of RDD persistence, the partitioning and
so on.

For example we have unit tests to make sure the number of stages that
composes the job synthesized for a given use case doesn't increase to
detect unforeseen shuffles that we would consider a regression.

Problem is that it's difficult to "introspect" the job produced by Spark.
Even counting the number of stages is complex, there isn't any simple
stageId associated to each RDD. Today we're doing it indirectly by
registering a listener and monitoring the events when the action is
invoked, but I would prefer doing it in a more direct way, like checking
the properties of an RDD.

So another approach could be analyzing, recursively, the dependencies of
the RDD where the action is invoked and counting the number of dependencies
that are a ShuffleDependency, making sure each parent RDD is considered
only once.

Does it make sense? It is a reliable method?


View raw message