spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay Ousterhout <...@eecs.berkeley.edu>
Subject Re: Why is shuffle data always persisted to disk?
Date Wed, 29 Mar 2017 21:00:21 GMT
There's been some discussion of this on this JIRA
<https://issues.apache.org/jira/browse/SPARK-3376> and the associated PR.
The short summary is that, in theory / to a VERY rough approximation, the
OS buffer cache does everything we'd want an in-memory shuffle to do, and
is simple.

On Wed, Mar 29, 2017 at 3:19 AM, Effi Ofer <effi.ofer@gmail.com> wrote:

> Greetings, I was wondering why Spark's Shuffler always persists the
> shuffle data to disk?  I understand that the persisted data can be used by
> the scheduler to truncate the lineage of the RDD graph if an existing RDD
> has been materialized as a side effect of an earlier shuffle.  But that
> does not explain why Spark is not keeping the shuffle RDD in memory until
> memory becomes sufficiently low to trigger victim selection and spilling.
> Any hints and pointers would be appreciated.
>
> Thanks,
> Effi
>
>

Mime
View raw message