spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Graves <>
Subject Re: planning & discussion for larger scheduler changes
Date Fri, 31 Mar 2017 13:47:29 GMT
filed [SPARK-20178] Improve Scheduler fetch failures - ASF JIRA
|   |    |


|   |  
[SPARK-20178] Improve Scheduler fetch failures - ASF JIRA
   |   |





    On Thursday, March 30, 2017 1:21 PM, Tom Graves <> wrote:

 If we are worried about major changes destabilizing current code (which I can understand)
only way around that is to make it pluggable or configurable.  For major changes it seems
like making it pluggable is cleaner from a code being cluttered point of view. But it also
means you may have to make the same or similar change in 2 places.We could make the interfaces
more well defined but if the major changes would require interfaces changes that doesn't help.
 It still seems like if we had a list of things we would like to accomplish and get an idea
of the rough overall design we could see if defining the interfaces better or making them
pluggable would help.
There seem to be 3 jiras all related to handling fetch failures:,
 SPARK-14649 , and SPARK-19753.  It might be nice to create one epic jira where we think
about a design as a whole and discuss that more.  Any objections to this?  If not I'll create
an epic and link the others to it.

    On Monday, March 27, 2017 9:01 PM, Kay Ousterhout <> wrote:

 (1) I'm pretty hesitant to merge these larger changes, even if they're feature flagged, because: 
 (a) For some of these changes, it's not obvious that they'll always improve performance.
e.g., for SPARK-14649, it's possible that the tasks that got re-started (and temporarily are
running in two places) are going to fail in the first attempt (because they haven't read the
missing map output yet).  In that case, not re-starting them will lead to worse performance. 
 (b) The scheduler already has some secret flags that aren't documented and are used by only
a few people.  I'd like to avoid adding more of these (e.g., by merging these features, but
having them off by default), because very few users use them (since it's hard to learn about
them), they add complexity to the scheduler that we have to maintain, and for users who are
considering using them, they often hide advanced behavior that's hard to reason about anyway
(e.g., the point above for SPARK-14649).    (c) The worst performance problem is when jobs
just hang or crash; we've seen a few cases of that in recent bugs, and I'm worried that merging
these complex performance improvements trades better performance in a small number of cases
for the possibility of worse performance via job crashes/hangs in other cases.
Roughly I think our standards for merging performance fixes to the scheduler should be that
the performance improvement either (a) is simple / easy to reason about or (b) unambiguously
fixes a serious performance problem.  In the case of SPARK-14649, for example, it is complex,
and improves performance in some cases but hurts it in others, so doesn't fit either (a) or
(2) I do think there are some scheduler re-factorings that would improve testability and our
ability to reason about correctness, but think there are some what surgical, smaller things
we could do in the vein of Imran's comment about reducing shared state.  Right now we have
these super wide interfaces between different components of the scheduler, and it means you
have to reason about the TSM, TSI, CGSB, and DAGSched to figure out whether something works. 
I think we could have an effort to make each component have a much narrower interface, so
that each part hides a bunch of complexity from other components.  The most obvious place
to do this in the short term is to remove a bunch of info tracking from the DAGScheduler;
I filed a JIRA for that here.  I suspect there are similar things that could be done in
other parts of the scheduler.
Tom's comments re: (2) are more about performance improvements rather than readability / testability
/ debuggability, but also seem important and it does seem useful to have a JIRA tracking those.
On Mon, Mar 27, 2017 at 11:06 AM, Tom Graves <> wrote:

1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649
seems much larger although more the way I think we want to go. While SPARK-13669 seems less
risky and easily configurable.
2) I don't know whether it needs an entire rewrite but I think there need to be some major
changes made especially in the handling of reduces and fetch failures.  We could do a much
better job of not throwing away existing work and more optimally handling the failure cases. 
For this would it make sense for us to start with a jira that has a bullet list of things
we would like to improve and get more cohesive view and see really how invasive it would be?

    On Friday, March 24, 2017 10:41 AM, Imran Rashid <> wrote:

 Kay and I were discussing some of the  bigger scheduler changes getting proposed lately,
and realized there is a broader discussion to have with the community, outside of any single
jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but
it would be good to input from everyone.
In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes
in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance
when there faults.  The changes make some intuitive sense, but its also hard to judge whether
they are necessarily better; its hard to verify the correctness of the changes; and its hard
to even know that we haven't broken the old behavior (because of how brittle the scheduler
seems to be).
So I'm wondering:
1) in the short-term, can we find ways to get these changes merged, but turned off by default,
in a way that we feel confident won't break existing code?
2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly,
to improve testability?  eg., maybe if it was rewritten to more completely follow the actor
model and eliminate shared state, the code would be cleaner and more testable.  Or maybe
this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing
the as many bugs in the new version.



View raw message