spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pallavi Singh <>
Subject Spark Optimization
Date Thu, 26 Apr 2018 15:49:09 GMT
Hi Team,

We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user defined aggregation
based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of two node we
could finish end to end job ((Read+Aggregation+Write))in 2 min.

Cluster Information:
Number of Node:2
Total Core:28Core
Total RAM:128GB

Spark Core


Tuning Parameter:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism 24
spark.sql.shuffle.partitions 24
spark.executor.extraJavaOptions -XX:+UseG1GC
spark.speculation true
spark.executor.memory 16G
spark.driver.memory 8G
spark.sql.codegen true
spark.sql.inMemoryColumnarStorage.batchSize 100000
spark.locality.wait 1s
spark.ui.showConsoleProgress false
Please let us know, If you have any ideas/tuning parameter that we can use to finish the job
in less than one min.

This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

View raw message