spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <>
Subject Re: Spark Scheduler - Task and job levels - How it Works?
Date Tue, 08 Jan 2019 18:29:53 GMT
Hi Miguel,

On Sun, Jan 6, 2019 at 11:35 AM Miguel F. S. Vasconcelos <> wrote:

> When an action is performed onto a RDD, Spark send it as a job to the
> DAGScheduler;
> The DAGScheduler compute the execution DAG based on the RDD's lineage, and
> split the job into stages (using wide dependencies);
> The resulting stages are transformed into a set of tasks, that are sent to
> the TaskScheduler;
> The TaskScheduler send the set of tasks to the executors, where they will
> run.
> Is this flow correct?

yes, more or less, though that's really just the beginning.  Then there is
an endless back-and-forth as executors complete tasks, send info back to
the driver, the driver updates some state, perhaps just launches more tasks
in the existing tasksets, or creates more, or finishes jobs, etc.

And are the jobs  discovered during the application execution and sent
> sequentially to the DAGScheduler?

yes, there are very specific apis to create a job -- in the user guide,
these are called "actions".  Most of these are blocking, eg. when the user
calls rdd.count(), a job is created, potentially with a very long lineage
and many stages, and only when the entire job is completed does rdd.count()
complete.  There are a few async versions (eg. countAsync()), but from what
I've seen, the more common way to submit multiple concurrent jobs is to use
the regular apis from multiple threads.

> Regarding this part /"finds a minimal schedule to run the job"/, I have not
> found this algorithm for getting the minimal schedule. Can you help me?

I think its just using narrow dependencies, and re-using existing cached
data & shuffle data whenever possible.  that's implicit in the DAGScheduler
& RDD code.

> I'm in doubt if the scheduling is at Task level, job level, or both. These
> scheduling modes: FIFO and FAIR, are for tasks or jobs?

FIFO and FAIR only matter if you've got multiple concurrent jobs.  But then
it controls how you schedule tasks *within* those jobs.  When a job is
submitted, the DAGScheduler will still compute the DAG and the next taskset
to run for that job, even if the entire cluster is busy.  But then as
resources free up, it needs to decide which job to submit tasks from.  Note
you might have 2 active jobs, with 5 stages ready to submit tasks, and
another 20 stages still waiting for their dependencies to be computed,
further down the pipeline for those jobs.

> Also, as the TaskScheduler is an interface, is possible to "plug" different
> scheduling algorithms to it, correct?

yes, though spark itself only has one implementation (I believe Facebook
has their own, not sure of others?).  There is a ExternalClusterManager api
to let users plug in their own.

> But what about the DAGScheduler, is there any interface that allows
> plugging
> different scheduling algorithms to it?

there is no interface currently.

In the video "Introduction to AmpLab Spark Internals" its said that
> pluggable inter job scheduling is a possible future extension. Anyone knows
> if this has already been addressed ?

don't think so.

> I'm starting a master degree and I'd really like to contribute to Spark.
> Are
> there suggestions of issues in the spark scheduling that could be
> addressed??

there is lots to do, but this is tough to answer.  Depends on your
interests in particular.  Also, to be honest, the work that needs to be
done often doesn't align well with the kind of work you need to do for a
research project.  For example, adding existing features into cluster
managers (eg. kubernetes), or adding tests & chasing down concurrency bugs
might not interest a student.  OTOH, if you create an entirely different
implementation of the DAGScheduler and do some tests on its properties
under various loads -- that would be really interesting, but its also
unlikely to get accepted given the quality of code that normal comes from
research projects, and without finding folks in the community that
understand it well and are ready to maintain it.

You can search for jiras with component "Scheduler":

there was some high level discussion a while back about a set of changes we
might consider making to scheduler, particularly for dealing w/ failures on
large clusters, but this never really picked up steam.  Might be
interesting for a research project:

View raw message