spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject Re: Inferring Data driven Spark parameters
Date Wed, 04 Jul 2018 07:04:49 GMT
I do not want to change executor/driver cores/memory on the fly in a single
Spark job, all I want is to make them cluster specific. So, I want to have
a formulae, with which, depending on the size of driver and executor
details, I can find out the values for them before submitting those details
in the spark-submit.

I, more or less know how to achieve the above as I've previously done that.

All I need to do is, I want to tweak the other spark confs depending on the
data. Is that possible? I mean (just an example), if I have 100+ features,
I want to double my default spark.driver.maxResultSize to 2G, and similarly
for other configs. Can that be achieved by any means for a optimal run on
that kind of dataset? If yes, can I?

On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov <vadim@datadoghq.com> wrote:

> You can't change the executor/driver cores/memory on the fly once
> you've already started a Spark Context.
> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu <aakash.spark.raj@gmail.com>
> wrote:
> >
> > We aren't using Oozie or similar, moreover, the end to end job shall be
> exactly the same, but the data will be extremely different (number of
> continuous and categorical columns, vertical size, horizontal size, etc),
> hence, if there would have been a calculation of the parameters to arrive
> at a conclusion that we can simply get the data and derive the respective
> configuration/parameters, it would be great.
> >
> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke <jornfranke@gmail.com>
> wrote:
> >>
> >> Don’t do this in your job. Create for different types of jobs different
> jobs and orchestrate them using oozie or similar.
> >>
> >> On 3. Jul 2018, at 09:34, Aakash Basu <aakash.spark.raj@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> Cluster - 5 node (1 Driver and 4 workers)
> >> Driver Config: 16 cores, 32 GB RAM
> >> Worker Config: 8 cores, 16 GB RAM
> >>
> >> I'm using the below parameters from which I know the first chunk is
> cluster dependent and the second chunk is data/code dependent.
> >>
> >> --num-executors 4
> >> --executor-cores 5
> >> --executor-memory 10G
> >> --driver-cores 5
> >> --driver-memory 25G
> >>
> >>
> >> --conf spark.sql.shuffle.partitions=100
> >> --conf spark.driver.maxResultSize=2G
> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000
> >>
> >> I've come upto these values depending on my R&D on the properties and
> the issues I faced and hence the handles.
> >>
> >> My ask here is -
> >>
> >> 1) How can I infer, using some formula or a code, to calculate the
> below chunk dependent on the data/code?
> >> 2) What are the other usable properties/configurations which I can use
> to shorten my job runtime?
> >>
> >> Thanks,
> >> Aakash.
> >
> >
>
>
> --
> Sent from my iPhone
>

Mime
View raw message