spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ranju Jain <Ranju.J...@ericsson.com.INVALID>
Subject RE: Dynamic Allocation Backlog Property in Spark on Kubernetes
Date Fri, 09 Apr 2021 04:03:21 GMT
Hi Attila,

Thanks for your reply.

If I talk about single job which started to run with minExecutors as 3. And Suppose this job
[which reads the full data from backend and process and writes it to a location]
takes around 2 hour to complete.

What I understood is, as the default value of spark.dynamicAllocation.schedulerBacklogTimeout
is 1 sec, so executors will scale from 3 to 4 and then 8 after every second if tasks are pending
at scheduler backend. So If I don’t want  it 1 sec and I might set it to 1 hour [3600 sec]
in 2 hour of spark job.

So this is all about when I want to scale executors dynamically for spark job. Is that understanding
correct?

In the below statement I don’t understand much about available partitions :-(
pending tasks (which kinda related to the available partitions)


Regards
Ranju


From: Attila Zsolt Piros <piros.attila.zsolt@gmail.com>
Sent: Friday, April 9, 2021 12:13 AM
To: Ranju Jain <Ranju.Jain@ericsson.com.invalid>
Cc: user@spark.apache.org
Subject: Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

Hi!

For dynamic allocation you do not need to run the Spark jobs in parallel.
Dynamic allocation simply means Spark scales up by requesting more executors when there are
pending tasks (which kinda related to the available partitions) and scales down when the executor
is idle (as within one job the number of partitions can fluctuate).

But if you optimize for run time you can start those jobs in parallel at the beginning.
In this case you will use higher number of executors even from the beginning.

The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to schedule/synchronize different
Spark jobs but it is about tasks.

Best regards,
Attila

On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain <Ranju.Jain@ericsson.com.invalid<mailto:Ranju.Jain@ericsson.com.invalid>>
wrote:
Hi All,

I have set dynamic allocation enabled while running spark on Kubernetes . But new executors
are requested if pending tasks are backlogged for more than configured duration in property
“spark.dynamicAllocation.schedulerBacklogTimeout”.

My Use Case is:

There are number of parallel jobs which might or might not run together at a particular point
of time. E.g Only One Spark Job may run at a point of time or two spark jobs may run at a
single point of time depending upon the need.
I configured spark.dynamicAllocation.minExecutors as 3 and spark.dynamicAllocation.maxExecutors
as 8 .

Steps:

  1.  SparkContext initialized with 3 executors and First Job requested.
  2.  Now, if second job requested after few mins  (e.g 15 mins) , I am thinking if I can
use the benefit of dynamic allocation and executor should scale up to handle second job tasks.

For this I think “spark.dynamicAllocation.schedulerBacklogTimeout” needs to set after
which new executors would be requested.

Problem: Problem is there are chances that second job is not requested at all or may be requested
after 10 mins or after 20 mins. How can I set a constant value for

property “spark.dynamicAllocation.schedulerBacklogTimeout” to scale the executors , when
tasks backlog is dependent upon the number of jobs requested.


Regards
Ranju
Mime
View raw message