spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Inferring Data driven Spark parameters
Date Tue, 03 Jul 2018 07:39:39 GMT
Don’t do this in your job. Create for different types of jobs different jobs and orchestrate
them using oozie or similar.

> On 3. Jul 2018, at 09:34, Aakash Basu <aakash.spark.raj@gmail.com> wrote:
> 
> Hi,
> 
> Cluster - 5 node (1 Driver and 4 workers)
> Driver Config: 16 cores, 32 GB RAM
> Worker Config: 8 cores, 16 GB RAM
> 
> I'm using the below parameters from which I know the first chunk is cluster dependent
and the second chunk is data/code dependent.
> 
> --num-executors 4 
> --executor-cores 5
> --executor-memory 10G 
> --driver-cores 5 
> --driver-memory 25G 
> 
> 
> --conf spark.sql.shuffle.partitions=100 
> --conf spark.driver.maxResultSize=2G 
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" 
> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000
> 
> I've come upto these values depending on my R&D on the properties and the issues
I faced and hence the handles.
> 
> My ask here is -
> 
> 1) How can I infer, using some formula or a code, to calculate the below chunk dependent
on the data/code?
> 2) What are the other usable properties/configurations which I can use to shorten my
job runtime?
> 
> Thanks,
> Aakash.

Mime
View raw message