spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CPC <acha...@gmail.com>
Subject Re: Spark Optimization
Date Thu, 26 Apr 2018 18:13:39 GMT
I would recommend UseParallelGC since this is a batch job. Parallelization
should be 2-3x of cores. Also if those are physical machines i would
recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?

On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Ideal parallelization is 2-3x the nb of cores. But it depends on the
> number of partitions of your source and the operation you use (Shuffle or
> not). It can be worth paying the extra cost of an initial repartition to
> match your cluster but it clearly depends on your DAG.
> Optimizing spark apps depends on lots of thing, it's hard to answer
> - cluster size
> - scheduler
> - spark version
> - transformation graph (DAG)
> ...
>
> Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <pallavi_singh@persistent.com>
> a écrit :
>
>> Hi Team,
>>
>>
>>
>> We are currently working on POC based on Spark and Scala.
>>
>> we have to read 18million records from parquet file and perform the 25
>> user defined aggregation based on grouping keys.
>>
>> we have used spark high level Dataframe API for the aggregation. On
>> cluster of two node we could finish end to end job
>> ((Read+Aggregation+Write))in 2 min.
>>
>>
>>
>> *Cluster Information:*
>>
>> Number of Node:2
>>
>> Total Core:28Core
>>
>> Total RAM:128GB
>>
>>
>>
>> *Component: *
>>
>> Spark Core
>>
>>
>>
>> *Scenario:*
>>
>> How-to
>>
>>
>>
>> *Tuning Parameter:*
>>
>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>
>> spark.default.parallelism 24
>>
>> spark.sql.shuffle.partitions 24
>>
>> spark.executor.extraJavaOptions -XX:+UseG1GC
>>
>> spark.speculation true
>>
>> spark.executor.memory 16G
>>
>> spark.driver.memory 8G
>>
>> spark.sql.codegen true
>>
>> spark.sql.inMemoryColumnarStorage.batchSize 100000
>>
>> spark.locality.wait 1s
>>
>> spark.ui.showConsoleProgress false
>>
>> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
>>
>> Please let us know, If you have any ideas/tuning parameter that we can
>> use to finish the job in less than one min.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Pallavi
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>

Mime
View raw message