spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Romi Kuntsman <>
Subject How to determine the value for spark.sql.shuffle.partitions?
Date Tue, 01 Sep 2015 09:17:49 GMT
Hi all,

The number of partition greatly affect the speed and efficiency of
calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.

Too few partitions with large data cause OOM exceptions.
Too many partitions on small data cause a delay due to overhead.

How do you programmatically determine the optimal number of partitions and
cores in Spark, as a function of:

   1. available memory per core
   2. number of records in input data
   3. average/maximum record size
   4. cache configuration
   5. shuffle configuration
   6. serialization
   7. etc?

Any general best practices?


Romi K.

View raw message