spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Inferring Data driven Spark parameters
Date Wed, 04 Jul 2018 14:30:06 GMT
Hi Aakash,

For clarification are you running this in Yarn client mode or standalone?

How much total yarn memory is available?

>From my experience for a bigger cluster I found the following incremental
settings useful (CDH 5.9, Yarn client) so you can scale yours

[1] - 576GB

--num-executors 24

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[2] - 672GB

--num-executors 28

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[3] - 786GB

--num-executors 32

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[4] - 864GB

--num-executors 32

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jul 2018 at 08:34, Aakash Basu <aakash.spark.raj@gmail.com> wrote:

> Hi,
>
> Cluster - 5 node (1 Driver and 4 workers)
> Driver Config: 16 cores, 32 GB RAM
> Worker Config: 8 cores, 16 GB RAM
>
> I'm using the below parameters from which I know the first chunk is
> cluster dependent and the second chunk is data/code dependent.
>
> --num-executors 4
> --executor-cores 5
> --executor-memory 10G
> --driver-cores 5
> --driver-memory 25G
>
>
> --conf spark.sql.shuffle.partitions=100
> --conf spark.driver.maxResultSize=2G
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000
>
> I've come upto these values depending on my R&D on the properties and the
> issues I faced and hence the handles.
>
> My ask here is -
>
> *1) How can I infer, using some formula or a code, to calculate the below
> chunk dependent on the data/code?*
> *2) What are the other usable properties/configurations which I can use to
> shorten my job runtime?*
>
> Thanks,
> Aakash.
>

Mime
View raw message