spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuefu Zhang <xu...@apache.org>
Subject Re: feedback about SPARK-22683 needed
Date Mon, 11 Dec 2017 17:57:19 GMT
At Uber we have observed the same resource efficiency issue with dynamic
allocation. Our workload is migrated from Hive on MR to Hive on Spark. We
saw significant performance improvement (>2X) with our workload. We also
expected big resource savings from this migration because there will be one
single Spark job to replace what would be many MR jobs. While there are
cases where Spark improves efficiency compared to MR, in many cases the
resource efficiency is significantly lower in Spark than in MR. This is
especially a great surprise to Hive on Spark because it essentially runs
the same Hive code and mainly uses Spark's shuffle. Resource consumption by
Spark would be a big issue for efficiency aware users.

We also analyzed that the resource consumption problem is tightly related
to the way dynamic allocation works. (Static allocation could be worse in
terms of efficiency.) The inefficiency probably comes in two aspects:

1. there is a big overhead of deallocating executors when they are no
longer needed.  By default, an executor will die out after 60s idle time.
(While one can argue that this can be tuned, but our experience showed that
getting lower than 60s will trigger out allocation problems.) When
thousands of executors are allocated and deallocated, idling out executors
wastes a lot mem-secs and core-secs in addition to allocating them in the
first place. Moreover, if in any stage there is a strangling task, all
other executors will be idling out.

2. Spark's executor tends to be bigger, usually with multiple cores, so it
has bigger inertia. Even if there is only one task to run, you still need
to allocate a big executor which could run multiple tasks.

Having said that, I'm not sure if the proposal in SPARK-22683  is the
solution. Nevertheless, we need to step back and have a general agreement
on whether there is resource efficiency problem in the dynamic allocation
(or Spark in general). From the JIRA discussions I sensed that there might
folks believing that the problem is only a matter of tuning. With Julien's
accounting and our experience, the problem is real and big.

Once we agree that there is problem, finding the right solution is the next
question. To me, the problem goes beyond just a matter of tuning.
(SPARK-22683  seems mitigating the problem also by fine tuning.) We need to
think of how Spark allocates resources and executes tasks. Besides of
dynamic allocation, we may consider bringing in a resource allocation mode
that's similar to the old, MR style allocation. That is, allocating
(smaller) executors only when needed and killing them right after them
executing tasks.

Thanks,
Xuefu


On Mon, Dec 11, 2017 at 7:44 AM, Julien Cuquemelle <j.cuquemelle@criteo.com>
wrote:

> Hi everyone,
>
>
>
> I'm currently porting a MapReduce Application to Spark (on a YARN
> cluster), and I'd like to have your insight regarding to the tuning of
> numbers of executors.
>
>
>
> This application is in fact a template that users can use to launch a
> variety of jobs which range from tens to thousands  of splits in the data
> partition and have a typical wall clock time of 400 to 9000 seconds; these
> jobs are experiments that are usually performed once or a few times, thus
> are not  easily tunable for resource consumption.
>
>
>
> As such, as it is the case with the MR jobs, I'd like the users not to
> have to specify a number of executors themselves, which is why I explored
>  the possibilities of the Dynamic Allocation in Spark.
>
>
>
> The current implementation of the dynamic allocation targets a total
> number of executors so that each core per executor executes 1 task; This
> seems to be in contradiction with spark guidelines that tasks should remain
> small and that each executor-core should process several tasks, and
> actually this gives the best overall latency but at the cost of a huge
> resource waste because some executors are not fully used, or even not used
> at all after having been allocated : in a representative set of experiments
> from our users, performed in an idle queue as well as in a busy queue, in
> average the latency in spark is decreased by 43%, but at the cost of an
> increase of 114% in the Vcore-hours usage w.r.t. the legacy MR job.
>
>
>
> Up to now I can't migrate these jobs to spark because of the doubling of
> resource usage.
>
>
>
> I did a proposal to allow tuning the target number of tasks that each
> executor-core (aka taskSlot) should process, which gives a way to tune the
> tradeoff between latency and vCore-Hours consumption:
> https://issues.apache.org/jira/browse/SPARK-22683
>
>
>
> As detailed in the proposal, I've been able to reach a 37% reduction in
> latency at iso-consumption (2 tasks per taskSlot), or a 30% reduction in
> resource usage at iso-latency (6 tasks per taskSlot), or a sweet spot at
> 20% reduction in resource consumption and 28% reduction in latency at 3
> tasks per slots. These figures are still averages over a representative set
> of jobs our users currently launch, and are to be compared with the
> doubling of resources usage of the current spark dynamic allocation
> behavior wrt MR.
>
> As mentioned by Sean Owen in our discussion of the proposal, we currently
> have 2 options allowing to tune the behavior of the dynamic allocation of
> executors, maxExecutors and schedulerBacklogTimeout, but these parameters
> would need to be tuned on a per-job basis, which is not compatible with the
> one-shot nature of the jobs I'm talking about.
>
>
>
> I’ve still tried to tune one series of jobs with the
> schedulerBacklogTimeout, and managed to get a similar vcore-hours
> consumption at the expense of 20% added in latency:
>
> - the resulting value of schedulerBacklogTimeout is only valid for other
> jobs that have a similar wall clock time, so will not be transposable to
> all the jobs our users launch;
>
> - even with a manually-tuned job, I don't get the same efficiency as a
> more global default I can set using my proposal.
>
>
>
> Thanks for any feedback regarding the pros and cons of adding a 3rd
> parameter to the dynamic allocation allowing to optimize the latency /
> consumption tradeoff over a family of jobs, or any proposal to achieve
> reducing resource usage without per job tuning with the existing Dynamic
> Allocation policy
>
>
>
> Julien
>

Mime
View raw message