spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shuai Zheng" <szheng.c...@gmail.com>
Subject RE: Why always spilling to disk and how to improve it?
Date Wed, 14 Jan 2015 20:10:39 GMT
Thanks a lot!

 

I just realize the spark is not a really in-memory version of mapreduce J

 

From: Akhil Das [mailto:akhil@sigmoidanalytics.com] 
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve it?

 

You could try setting the following to tweak the application a little bit:

 

      .set("spark.rdd.compress","true")

      .set("spark.storage.memoryFraction", "1")

      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

 

For shuffle behavior, you can look at this document https://spark.apache.org/docs/1.1.0/configuration.html#shuffle-behavior




Thanks

Best Regards

 

On Wed, Jan 14, 2015 at 1:51 AM, Shuai Zheng <szheng.code@gmail.com> wrote:

Hi All,

 

I am trying with some small data set. It is only 200m, and what I am doing is just do a distinct
count on it.

But there are a lot of spilling happen in the log (I attached in the end of the email).

 

Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge instance type (which
has 244G memory and 32 vCPU).

 

My code is simple, run in the spark-shell (~/spark/bin/spark-shell --executor-cores 4 --executor-memory
10G)

 

val llg = sc.textFile("s3://…/part-r-00000") // File is around 210.5M, 4.7M rows inside

//val llg = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75", "-240990|161324,9051480,0,2,30.48,75"))

val ids = llg.flatMap(line => line.split(",").slice(0,1)) //Try to get the first column
as key

val counts = ids.distinct.count

 

I think I should have enough memory, so there should not have any spilling happen. Anyone
can give me some idea why or where I can tuning the system to reduce the spilling (it is not
an issue on this dataset, but I want to see how to tuning it up).

The Spark UI shows only 24.2MB on the shuffle write. And if I have 10G memory for executor,
why it need to spill.

 

2015-01-13 20:01:53,010 INFO  [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerMaster
(Logging.scala:logInfo(59)) - Updated info of block broadcast_2_piece0

2015-01-13 20:01:53,011 INFO  [Spark Context Cleaner] spark.ContextCleaner (Logging.scala:logInfo(59))
- Cleaned broadcast 2

2015-01-13 20:01:53,399 INFO  [Executor task launch worker-5] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 149 spilling in-memory map of 23.4 MB to disk (3 times
so far)

2015-01-13 20:01:53,516 INFO  [Executor task launch worker-7] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 151 spilling in-memory map of 23.4 MB to disk (3 times
so far)

2015-01-13 20:01:53,531 INFO  [Executor task launch worker-6] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 150 spilling in-memory map of 23.2 MB to disk (3 times
so far)

2015-01-13 20:01:53,793 INFO  [Executor task launch worker-4] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 148 spilling in-memory map of 23.4 MB to disk (3 times
so far)

2015-01-13 20:01:54,460 INFO  [Executor task launch worker-5] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 149 spilling in-memory map of 23.2 MB to disk (4 times
so far)

2015-01-13 20:01:54,469 INFO  [Executor task launch worker-7] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 151 spilling in-memory map of 23.2 MB to disk (4 times
so far)

2015-01-13 20:01:55,144 INFO  [Executor task launch worker-6] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 150 spilling in-memory map of 24.2 MB to disk (4 times
so far)

2015-01-13 20:01:55,192 INFO  [Executor task launch worker-4] collection.ExternalAppendOnlyMap
(Logging.scala:logInfo(59)) - Thread 148 spilling in-memory map of 23.2 MB to disk (4 times
so far)

 

I am trying to collect more benchmark for next step bigger dataset and more complex logic.

 

Regards,

 

Shuai

 


Mime
View raw message