spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Hubregtsen <>
Subject Spilling when not expected
Date Thu, 12 Mar 2015 23:09:11 GMT
Hi all,

I'm running the teraSort benchmark with a relative small input set: 5GB.
During profiling, I can see I am using a total of 68GB. I've got a terabyte
of memory in my system, and set
spark.executor.memory 900g
spark.driver.memory 900g
I use the default for 
I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for

I noticed a lot of variation in runtime (under the same load), and tracked
this down to this function in 
  private def spillToPartitionFiles(collection:
SizeTrackingPairCollection[(Int, K), C]): Unit = {
In a slow run, it would loop through this function 12000 times, in a fast
run only 700 times, even though the settings in both runs are the same and
there are no other users on the system. When I look at the function calling
this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
is only called 700 times in both fast and slow runs, meaning that the
function recursively calls itself very often. Because of the function name,
I assume the system is spilling to disk. As I have sufficient memory, I
assume that I forgot to set a certain memory setting. Anybody any idea which
other setting I have to set, in order to not spill data in this scenario?



View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message