spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Graves <>
Subject Re: planning & discussion for larger scheduler changes
Date Mon, 27 Mar 2017 18:06:28 GMT
1) I think this depends on individual case by case jira.  I haven't looked in detail at spark-14649
seems much larger although more the way I think we want to go. While SPARK-13669 seems less
risky and easily configurable.
2) I don't know whether it needs an entire rewrite but I think there need to be some major
changes made especially in the handling of reduces and fetch failures.  We could do a much
better job of not throwing away existing work and more optimally handling the failure cases.
 For this would it make sense for us to start with a jira that has a bullet list of things
we would like to improve and get more cohesive view and see really how invasive it would be?

    On Friday, March 24, 2017 10:41 AM, Imran Rashid <> wrote:

 Kay and I were discussing some of the  bigger scheduler changes getting proposed lately,
and realized there is a broader discussion to have with the community, outside of any single
jira.  I'll start by sharing my initial thoughts, I know Kay has thoughts on this too, but
it would be good to input from everyone.
In particular, SPARK-14649 & SPARK-13669 have got me thinking.  These are proposed changes
in behavior that are not fixes for *correctness* in fault tolerance, but to improve the performance
when there faults.  The changes make some intuitive sense, but its also hard to judge whether
they are necessarily better; its hard to verify the correctness of the changes; and its hard
to even know that we haven't broken the old behavior (because of how brittle the scheduler
seems to be).
So I'm wondering:
1) in the short-term, can we find ways to get these changes merged, but turned off by default,
in a way that we feel confident won't break existing code?
2) a bit longer-term -- should we be considering bigger rewrites to the scheduler?  Particularly,
to improve testability?  eg., maybe if it was rewritten to more completely follow the actor
model and eliminate shared state, the code would be cleaner and more testable.  Or maybe
this is a crazy idea, and we'd just lose everything we'd learned so far and be stuck fixing
the as many bugs in the new version.

View raw message