spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <>
Subject Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
Date Tue, 26 Mar 2019 17:05:08 GMT
+1 on the updated SPIP

I agree with all of Mark's concerns, that eventually we want some way for
users to express per-task constraints -- but I feel like this is a still a
reasonable step forward.

In the meantime, users will either write small spark applications, which
just do the steps which need gpus, and then run separate spark applications
which don't, with something external to orchestrate that pipeline; or
they'll run one giant application, which utilizes resources really poorly.
After we have task-specific constraints, both will still work, but there
would be motivation to tune the giant application.  And plenty of users
might still want to write small spark applications that use gpus, and this
would continue to help them out, without having to worry about the
complexity of per-task constraints.

On Tue, Mar 26, 2019 at 12:33 AM Xingbo Jiang <> wrote:

> +1 on the updated SPIP
> Xingbo Jiang <> 于2019年3月26日周二 下午1:32写道:
>> Hi all,
>> Now we have had a few discussions over the updated SPIP, we also updated
>> the SPIP addressing new feedbacks from some committers. IMO the SPIP is
>> ready for another round of vote now.
>> On the updated SPIP, we currently have two +1s (from Tom and Xiangrui),
>> everyone else please vote again.
>> The vote will be up for the next 72 hours.
>> Thanks!
>> Xingbo
>> Xiangrui Meng <> 于2019年3月26日周二 上午11:32写道:
>>> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra <>
>>> wrote:
>>>> Maybe.
>>>> And I expect that we will end up doing something based on
>>>> spark.task.cpus in the short term. I'd just rather that this SPIP not make
>>>> it look like this is the way things should ideally be done. I'd prefer that
>>>> we be quite explicit in recognizing that this approach is a significant
>>>> compromise, and I'd like to see at least some references to the beginning
>>>> of serious longer-term efforts to do something better in a deeper re-design
>>>> of resource scheduling.
>>> It is also a feature I desire as a user. How about suggesting it as a
>>> future work in the SPIP? It certainly requires someone who fully
>>> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
>>> don't know much about scheduler like you do, but I can speak for DL use
>>> cases. Maybe we just view it from different angles. To you
>>> application-level request is a significant compromise. To me it provides a
>>> major milestone that brings GPU to Spark workload. I know many users who
>>> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
>>> scope covered by the current SPIP makes those users much happier. Tom and
>>> Andy from NVIDIA are certainly more calibrated on the usefulness of the
>>> current proposal.
>>>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng <>
>>>> wrote:
>>>>> There are certainly use cases where different stages require different
>>>>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>>>>> disagrees that ideally users should be able to do it. We are just dealing
>>>>> with typical engineering trade-offs and see how we break it down into
>>>>> smaller ones. I think it is fair to treat the task-level resource request
>>>>> as a separate feature here because it also applies to CPUs alone without
>>>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>>>>> years Spark is still able to cover many many use cases. Otherwise we
>>>>> shouldn't see many Spark users around now. Here we just apply similar
>>>>> arguments to GPUs.
>>>>> Initially, I was the person who really wanted task-level requests
>>>>> because it is ideal. In an offline discussion, Andy Feng pointed out
>>>>> application-level setting should fit common deep learning training and
>>>>> inference cases and it greatly simplifies necessary changes required
>>>>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>>>>> the application-level approach became my first choice because it is still
>>>>> very valuable but much less risky. If a feature brings great value to
>>>>> users, we should add it even it is not ideal.
>>>>> Back to the default value discussion, let's forget GPUs and only
>>>>> consider CPUs. Would an application-level default number of CPU cores
>>>>> disappear if we added task-level requests? If yes, does it mean that
>>>>> have to explicitly state the resource requirements for every single stage?
>>>>> It is tedious to do and who do not fully understand the impact would
>>>>> probably do it wrong and waste even more resources. Then how many cores
>>>>> each task should use if user didn't specify it? I do see "spark.task.cpus"
>>>>> is the answer here. The point I want to make is that "spark.task.cpus",
>>>>> though less ideal, is still needed when we have task-level requests for
>>>>> CPUs.
>>>>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra <>
>>>>> wrote:
>>>>>> I remain unconvinced that a default configuration at the application
>>>>>> level makes sense even in that case. There may be some applications
>>>>>> you know a priori that almost all the tasks for all the stages for
all the
>>>>>> jobs will need some fixed number of gpus; but I think the more common
>>>>>> will be dynamic configuration at the job or stage level. Stage level
>>>>>> have a lot of overlap with barrier mode scheduling -- barrier mode
>>>>>> having a need for an inter-task channel resource, gpu-ified stages
>>>>>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>>>>>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>>>>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng <>
>>>>>> wrote:
>>>>>>> Say if we support per-task resource requests in the future, it
>>>>>>> be still inconvenient for users to declare the resource requirements
>>>>>>> every single task/stage. So there must be some default values
>>>>>>> somewhere for task resource requirements. "spark.task.cpus" and
>>>>>>> "spark.task.accelerator.gpu.count" could serve for this purpose
>>>>>>> introducing breaking changes. So I'm +1 on the updated SPIP.
It fairly
>>>>>>> separated necessary GPU support from risky scheduler changes.
>>>>>>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra <
>>>>>>>> wrote:
>>>>>>>> Of course there is an issue of the perfect becoming the enemy
>>>>>>>> the good, so I can understand the impulse to get something
done. I am left
>>>>>>>> wanting, however, at least something more of a roadmap to
a task-level
>>>>>>>> future than just a vague "we may choose to do something more
in the
>>>>>>>> future." At the risk of repeating myself, I don't think the
>>>>>>>> existing spark.task.cpus is very good, and I think that building
more on
>>>>>>>> that weak foundation without a more clear path or stated
intention to move
>>>>>>>> to something better runs the risk of leaving Spark stuck
in a bad
>>>>>>>> neighborhood.
>>>>>>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves <>
>>>>>>>> wrote:
>>>>>>>>> While I agree with you that it would be ideal to have
the task
>>>>>>>>> level resources and do a deeper redesign for the scheduler,
I think that
>>>>>>>>> can be a separate enhancement like was discussed earlier
in the thread.
>>>>>>>>> That feature is useful without GPU's.  I do realize that
they overlap some
>>>>>>>>> but I think the changes for this will be minimal to the
scheduler, follow
>>>>>>>>> existing conventions, and it is an improvement over what
we have now. I
>>>>>>>>> know many users will be happy to have this even without
the task level
>>>>>>>>> scheduling as many of the conventions used now to scheduler
gpus can easily
>>>>>>>>> be broken by one bad user.     I think from the user
point of view this
>>>>>>>>> gives many users an improvement and we can extend it
later to cover more
>>>>>>>>> use cases.
>>>>>>>>> Tom
>>>>>>>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra
>>>>>>>>>> wrote:
>>>>>>>>> I understand the application-level, static, global nature
>>>>>>>>> of spark.task.accelerator.gpu.count and its similarity
to the
>>>>>>>>> existing spark.task.cpus, but to me this feels like extending
a weakness of
>>>>>>>>> Spark's scheduler, not building on its strengths. That
is because I
>>>>>>>>> consider binding the number of cores for each task to
an application
>>>>>>>>> configuration to be far from optimal. This is already
far from the desired
>>>>>>>>> behavior when an application is running a wide range
of jobs (as in a
>>>>>>>>> generic job-runner style of Spark application), some
of which require or
>>>>>>>>> can benefit from multi-core tasks, others of which will
just waste the
>>>>>>>>> extra cores allocated to their tasks. Ideally, the number
of cores
>>>>>>>>> allocated to tasks would get pushed to an even finer
granularity that jobs,
>>>>>>>>> and instead being a per-stage property.
>>>>>>>>> Now, of course, making allocation of general-purpose
cores and
>>>>>>>>> domain-specific resources work in this finer-grained
fashion is a lot more
>>>>>>>>> work than just trying to extend the existing resource
allocation mechanisms
>>>>>>>>> to handle domain-specific resources, but it does feel
to me like we should
>>>>>>>>> at least be considering doing that deeper redesign.
>>>>>>>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves
>>>>>>>>> <> wrote:
>>>>>>>>> Tthe proposal here is that all your resources are static
and the
>>>>>>>>> gpu per task config is global per application, meaning
you ask for a
>>>>>>>>> certain amount memory, cpu, GPUs for every executor up
front just like you
>>>>>>>>> do today and every executor you get is that size.  This
means that both
>>>>>>>>> static or dynamic allocation still work without explicitly
adding more
>>>>>>>>> logic at this point. Since the config for gpu per task
is global it means
>>>>>>>>> every task you want will need a certain ratio of cpu
to gpu.  Since that is
>>>>>>>>> a global you can't really have the scenario you mentioned,
all tasks are
>>>>>>>>> assuming to need GPU.  For instance. I request 5 cores,
2 GPUs, set 1 gpu
>>>>>>>>> per task for each executor.  That means that I could
only run 2 tasks and 3
>>>>>>>>> cores would be wasted.  The stage/task level configuration
of resources was
>>>>>>>>> removed and is something we can do in a separate SPIP.
>>>>>>>>> We thought erroring would make it more obvious to the
user.  We
>>>>>>>>> could change this to a warning if everyone thinks that
is better but I
>>>>>>>>> personally like the error until we can implement the
per lower level per
>>>>>>>>> stage configuration.
>>>>>>>>> Tom
>>>>>>>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido
>>>>>>>>>> wrote:
>>>>>>>>> Thanks for this SPIP.
>>>>>>>>> I cannot comment on the docs, but just wanted to highlight
>>>>>>>>> thing. In page 5 of the SPIP, when we talk about DRA,
I see:
>>>>>>>>> "For instance, if each executor consists 4 CPUs and 2
GPUs, and
>>>>>>>>> each task requires 1 CPU and 1GPU, then we shall throw
an error on
>>>>>>>>> application start because we shall always have at least
2 idle CPUs per
>>>>>>>>> executor"
>>>>>>>>> I am not sure this is a correct behavior. We might have
>>>>>>>>> requiring only CPU running in parallel as well, hence
that may make sense.
>>>>>>>>> I'd rather emit a WARN or something similar. Anyway we
just said we will
>>>>>>>>> keep GPU scheduling on task level out of scope for the
moment, right?
>>>>>>>>> Thanks,
>>>>>>>>> Marco
>>>>>>>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng
>>>>>>>>>> ha scritto:
>>>>>>>>> Steve, the initial work would focus on GPUs, but we will
keep the
>>>>>>>>> interfaces general to support other accelerators in the
future. This was
>>>>>>>>> mentioned in the SPIP and draft design.
>>>>>>>>> Imran, you should have comment permission now. Thanks
for making a
>>>>>>>>> pass! I don't think the proposed 3.0 features should
block Spark 3.0
>>>>>>>>> release either. It is just an estimate of what we could
deliver. I will
>>>>>>>>> update the doc to make it clear.
>>>>>>>>> Felix, it would be great if you can review the updated
docs and
>>>>>>>>> let us know your feedback.
>>>>>>>>> ** How about setting a tentative vote closing time to
next Tue
>>>>>>>>> (Mar 26)?
>>>>>>>>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <
>>>>>>>>>> wrote:
>>>>>>>>> Thanks for sending the updated docs.  Can you please
give everyone
>>>>>>>>> the ability to comment?  I have some comments, but overall
I think this is
>>>>>>>>> a good proposal and addresses my prior concerns.
>>>>>>>>> My only real concern is that I notice some mention of
"must dos"
>>>>>>>>> for spark 3.0.  I don't want to make any commitment to
holding spark 3.0
>>>>>>>>> for parts of this, I think that is an entirely separate
decision.  However
>>>>>>>>> I'm guessing this is just a minor wording issue, and
you really mean that's
>>>>>>>>> a minimal set of features you are aiming for, which is
>>>>>>>>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <
>>>>>>>>>> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> I updated the SPIP doc
>>>>>>>>> <>
>>>>>>>>> and stories
>>>>>>>>> <>,
>>>>>>>>> I hope it now contains clear scope of the changes and
enough details for
>>>>>>>>> SPIP vote.
>>>>>>>>> Please review the updated docs, thanks!
>>>>>>>>> Xiangrui Meng <> 于2019年3月6日周三
>>>>>>>>> How about letting Xingbo make a major revision to the
SPIP doc to
>>>>>>>>> make it clear what proposed are? I like Felix's suggestion
to switch to the
>>>>>>>>> new Heilmeier template, which helps clarify what are
proposed and what are
>>>>>>>>> not. Then let's review the new SPIP and resume the vote.
>>>>>>>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <>
>>>>>>>>> wrote:
>>>>>>>>> OK, I suppose then we are getting bogged down into what
a vote on
>>>>>>>>> an SPIP means then anyway, which I guess we can set aside
for now.  With
>>>>>>>>> the level of detail in this proposal, I feel like there
is a reasonable
>>>>>>>>> chance I'd still -1 the design or implementation.
>>>>>>>>> And the other thing you're implicitly asking the community
for is
>>>>>>>>> to prioritize this feature for continued review and maintenance.
 There is
>>>>>>>>> already work to be done in things like making barrier
mode support dynamic
>>>>>>>>> allocation (SPARK-24942), bugs in failure handling (eg.
SPARK-25250), and
>>>>>>>>> general efficiency of failure handling (eg. SPARK-25341,
SPARK-20178).  I'm
>>>>>>>>> very concerned about getting spread too thin.
>>>>>>>>> But if this is really just a vote on (1) is better gpu
>>>>>>>>> important for spark, in some form, in some release? and
(2) is it
>>>>>>>>> *possible* to do this in a safe way?  then I will vote
>>>>>>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <>
>>>>>>>>> wrote:
>>>>>>>>> So to me most of the questions here are implementation/design
>>>>>>>>> questions, I've had this issue in the past with SPIP's
where I expected to
>>>>>>>>> have more high level design details but was basically
told that belongs in
>>>>>>>>> the design jira follow on. This makes me think we need
to revisit what a
>>>>>>>>> SPIP really need to contain, which should be done in
a separate thread.
>>>>>>>>> Note personally I would be for having more high level
details in it.
>>>>>>>>> But the way I read our documentation on a SPIP right
now that
>>>>>>>>> detail is all optional, now maybe we could argue its
based on what
>>>>>>>>> reviewers request, but really perhaps we should make
the wording of that
>>>>>>>>> more required.  thoughts?  We should probably separate
that discussion if
>>>>>>>>> people want to talk about that.
>>>>>>>>> For this SPIP in particular the reason I +1 it is because
it came
>>>>>>>>> down to 2 questions:
>>>>>>>>> 1) do I think spark should support this -> my answer
is yes, I
>>>>>>>>> think this would improve spark, users have been requesting
both better GPUs
>>>>>>>>> support and support for controlling container requests
at a finer
>>>>>>>>> granularity for a while.  If spark doesn't support this
then users may go
>>>>>>>>> to something else, so I think it we should support it
>>>>>>>>> 2) do I think its possible to design and implement it
>>>>>>>>> causing large instabilities?   My opinion here again
is yes. I agree with
>>>>>>>>> Imran and others that the scheduler piece needs to be
looked at very
>>>>>>>>> closely as we have had a lot of issues there and that
is why I was asking
>>>>>>>>> for more details in the design jira:
>>>>>>>>>  But
I do
>>>>>>>>> believe its possible to do.
>>>>>>>>> If others have reservations on similar questions then
I think we
>>>>>>>>> should resolve here or take the discussion of what a
SPIP is to a different
>>>>>>>>> thread and then come back to this, thoughts?
>>>>>>>>> Note there is a high level design for at least the core
>>>>>>>>> which is what people seem concerned with, already so
including it in the
>>>>>>>>> SPIP should be straight forward.
>>>>>>>>> Tom
>>>>>>>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid
>>>>>>>>>> wrote:
>>>>>>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <>
>>>>>>>>> wrote:
>>>>>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <
>>>>>>>>>> wrote:
>>>>>>>>> IMO upfront allocation is less useful. Specifically too
>>>>>>>>> for large jobs.
>>>>>>>>> This is also an API/design discussion.
>>>>>>>>> I agree with Felix -- this is more than just an API question.
>>>>>>>>> has a huge impact on the complexity of what you're proposing.
 You might be
>>>>>>>>> proposing big changes to a core and brittle part of spark,
which is already
>>>>>>>>> short of experts.
>>>>>>>>> I don't see any value in having a vote on "does feature
X sound
>>>>>>>>> cool?"  We have to evaluate the potential benefit against
the risks the
>>>>>>>>> feature brings and the continued maintenance cost.  We
don't need super
>>>>>>>>> low-level details, but we have to a sketch of the design
to be able to make
>>>>>>>>> that tradeoff.

View raw message