spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burak Yavuz <bya...@stanford.edu>
Subject Re: Spark and disk usage.
Date Wed, 17 Sep 2014 14:44:51 GMT
Hi,

The files you mentioned are temporary files written by Spark during shuffling. ALS will write
a LOT of those files as it is a shuffle heavy algorithm.
Those files will be deleted after your program completes as Spark looks for those files in
case a fault occurs. Having those files ready allows Spark to 
continue from the stage the shuffle left off, instead of starting from the very beginning.

Long story short, it's to your benefit that Spark writes those files to disk. If you don't
want Spark writing to disk, you can specify a checkpoint directory in
HDFS, where Spark will write the current status instead and will clean up files from disk.

Best,
Burak

----- Original Message -----
From: "Макар Красноперов" <connector.acm@gmail.com>
To: user@spark.apache.org
Sent: Wednesday, September 17, 2014 7:37:49 AM
Subject: Spark and disk usage.

Hello everyone.

The problem is that spark write data to the disk very hard, even if
application has a lot of free memory (about 3.8g).
So, I've noticed that folder with name like
"spark-local-20140917165839-f58c" contains a lot of other folders with
files like "shuffle_446_0_1". The total size of files in the dir
"spark-local-20140917165839-f58c" can reach 1.1g.
Sometimes its size decreases (are there only temp files in that folder?),
so the totally amount of data written to the disk is greater than 1.1g.

The question is what kind of data Spark store there and can I make spark
not to write it on the disk and just keep it in the memory if there is
enough RAM free space?

I run my job locally with Spark 1.0.1:
./bin/spark-submit --driver-memory 12g --master local[3] --properties-file
conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar

spark-defaults.conf :
spark.shuffle.spill             false
spark.reducer.maxMbInFlight     1024
spark.shuffle.file.buffer.kb    2048
spark.storage.memoryFraction    0.7

The situation with disk usage is common for many jobs. I had also used ALS
from MLIB and saw the similar things.

I had reached no success by playing with spark configuration and i hope
someone can help me :)


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message