spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <u.cel...@fu-berlin.de>
Subject Re: Task output before a shuffle
Date Tue, 29 Oct 2013 12:13:56 GMT
On 29 Oct 2013, at 02:47, Matei Zaharia <matei.zaharia@gmail.com> wrote:
> Yes, we still write out data after these tasks in Spark 0.8, and it needs to be written
out before any stage that reads it can start. The main reason is simplicity when there are
faults, as well as more flexible scheduling (you don't have to decide where each reduce task
is in advance, you can have more reduce tasks than you have CPU cores, etc).

Thank you for the answer! I have a follow-up:

In which fraction (RDD or non-RDD) of the heap will the output be stored before spilling to
disk?

I have a job where I read over all large data set once and don't persist anything. Would it
make sense to set "spark.storage.memoryFraction" to a smaller value in order to avoid spilling
to disk?

- Ufuk
Mime
View raw message