spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Slavitch <>
Subject Re: Eliminating shuffle write and spill disk IO reads/writes in Spark
Date Fri, 01 Apr 2016 21:10:25 GMT
I totally disagree that it’s not a problem.

- Network fetch throughput on 40G Ethernet exceeds the throughput of NVME drives.
- What Spark is depending on is Linux’s IO cache as an effective buffer pool  This is fine
for small jobs but not for jobs with datasets in the TB/node range.
- On larger jobs flushing the cache causes Linux to block.
- On a modern 56-hyperthread 2-socket host the latency caused by multiple executors writing
out to disk increases greatly. 

I thought the whole point of Spark was in-memory computing?  It’s in fact in-memory for
some things but  use spark.local.dir as a buffer pool of others.  

Hence, the performance of  Spark is gated by the performance of spark.local.dir, even on large
memory systems.

"Currently it is not possible to not write shuffle files to disk.”

What changes >would< make it possible?

The only one that seems possible is to clone the shuffle service and make it in-memory.

> On Apr 1, 2016, at 4:57 PM, Reynold Xin <> wrote:
> spark.shuffle.spill actually has nothing to do with whether we write shuffle files to
disk. Currently it is not possible to not write shuffle files to disk, and typically it is
not a problem because the network fetch throughput is lower than what disks can sustain. In
most cases, especially with SSDs, there is little difference between putting all of those
in memory and on disk.
> However, it is becoming more common to run Spark on a few number of beefy nodes (e.g.
2 nodes each with 1TB of RAM). We do want to look into improving performance for those. Meantime,
you can setup local ramdisks on each node for shuffle writes.
> On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch < <>>
> Hello;
> I’m working on spark with very large memory systems (2TB+) and notice that Spark spills
to disk in shuffle.  Is there a way to force spark to stay in memory when doing shuffle operations?
  The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x)
and never touch the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in Tungsten sort in
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this is ignored
by the tungsten-sort shuffle manager; its optimized shuffles will continue to spill to disk
when necessary.”
> If this is impossible via configuration changes what code changes would be needed to
accomplish this?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: <>
> For additional commands, e-mail: <>

View raw message