spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kennedy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-25889) Dynamic allocation load-aware ramp up
Date Fri, 22 Feb 2019 00:11:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam Kennedy updated SPARK-25889:
---------------------------------
    Description: 
The time based exponential ramp up behavior for dynamic allocation is naive and destructive,
making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions (200,000+) the
default dynamic allocation approach of requesting containers in waves, doubling exponentially
once per second, results in 50% of the entire cluster being requested in the final 1 second
wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup steps to break
systems. With the interval so short, many additional containers may be requested beyond what
is actually needed and then complete very little work before sitting around waiting to be
deallocated.

Delaying the time between these fixed doublings only has limited impact. Setting double intervals
to once per minute would result in a very slow ramp up speed, at the end of which we still
face large potentially crippling waves of executor startup.

An alternative approach to spooling up large job appears to be needed, which is still relatively
simple but could be more adaptable to different cluster sizes and differing cluster and job
performance.

I would like to propose a few different approaches based around the general idea of controlling outstanding
requests for new containers based on the number of executors that are currently running, for
some definition of "running".

One example might be to limit requests to one new executor for every existing executor that
currently has an active task. Or some ratio of that, to allow for more or less aggressive
spool up. A lower number would let us approximate something like fibonacci ramp up, a higher
number of say 2x would spool up quickly, but still aligned with the rate at which broadcast
blocks can be easily distributed to new members.

An alternative approach might be to limit the escalation rate of new executor requests based
on the number of outstanding executors requested which have not yet fully completed startup
and are not available for tasks. To protect against a potentially suboptimal very early ramp,
a minimum concurrent executor startup threshold might allow an initial burst of say 10 executors,
after which the more gradual ramp math would apply.

  was:
The time based exponential ramp up behavior for dynamic allocation is naive and destructive,
making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions (200,000+) the
default dynamic allocation approach of requesting containers in waves, doubling exponentially
once per second, results in 50% of the entire cluster being requested in the final 1 second
wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup steps to break
systems. With the interval so short, many additional containers may be requested beyond what
is actually needed and then complete very little work before sitting around waiting to be
deallocated.

Delaying the time between these fixed doublings only has limited impact. Setting double intervals
to once per minute would result in a very slow ramp up speed, at the end of which we still
face large potentially crippling waves of executor startup.

An alternative approach to spooling up large job appears to be needed, which is still relatively
simple but could be more adaptable to different cluster sizes and differing cluster and job
performance.

I would like to propose a few different approaches based around the general idea of controlling outstanding
requests for new containers based on the number of executors that are currently running, for
some definition of "running".

One example might be to limit requests to one new executor for every existing executor that
currently has an active task. Or some ratio of that, to allow for more or less aggressive
spool up. A lower number would let us approximate something like fibonacci ramp up, a higher
number of say 2x would spool up quickly, but still aligned with the rate at which broadcast
blocks can be easily distributed to new members.

 


> Dynamic allocation load-aware ramp up
> -------------------------------------
>
>                 Key: SPARK-25889
>                 URL: https://issues.apache.org/jira/browse/SPARK-25889
>             Project: Spark
>          Issue Type: New Feature
>          Components: Scheduler, YARN
>    Affects Versions: 2.3.2
>            Reporter: Adam Kennedy
>            Priority: Major
>
> The time based exponential ramp up behavior for dynamic allocation is naive and destructive,
making it very difficult to run very large jobs.
> On a large (64,000 core) YARN cluster with a high number of input partitions (200,000+)
the default dynamic allocation approach of requesting containers in waves, doubling exponentially
once per second, results in 50% of the entire cluster being requested in the final 1 second
wave.
> This can easily overwhelm RPC processing, or cause expensive Executor startup steps
to break systems. With the interval so short, many additional containers may be requested
beyond what is actually needed and then complete very little work before sitting around waiting
to be deallocated.
> Delaying the time between these fixed doublings only has limited impact. Setting double
intervals to once per minute would result in a very slow ramp up speed, at the end of which
we still face large potentially crippling waves of executor startup.
> An alternative approach to spooling up large job appears to be needed, which is still
relatively simple but could be more adaptable to different cluster sizes and differing cluster
and job performance.
> I would like to propose a few different approaches based around the general idea of controlling outstanding
requests for new containers based on the number of executors that are currently running, for
some definition of "running".
> One example might be to limit requests to one new executor for every existing executor
that currently has an active task. Or some ratio of that, to allow for more or less aggressive
spool up. A lower number would let us approximate something like fibonacci ramp up, a higher
number of say 2x would spool up quickly, but still aligned with the rate at which broadcast
blocks can be easily distributed to new members.
> An alternative approach might be to limit the escalation rate of new executor requests
based on the number of outstanding executors requested which have not yet fully completed
startup and are not available for tasks. To protect against a potentially suboptimal very
early ramp, a minimum concurrent executor startup threshold might allow an initial burst of
say 10 executors, after which the more gradual ramp math would apply.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message