spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <>
Subject Task output before a shuffle
Date Mon, 28 Oct 2013 16:25:33 GMT
Hey everybody,

I just watched the Spark Internals presentation [1] from the December 2012 dev meetup and
have a couple of questions regarding the output of tasks before a shuffle.

1. Can anybody confirm that the default is still to persist stage output to RAM/disk and then
have the following tasks pull it (see [1] around 45:40)? I guess a couple of things have changed
since last year. I just want to be sure that this is not one of those things. ;-)

2. Is it possible to switch to a "push" model between stages instead of having the following
tasks "pull" the result? I guess this is equivalent to the question whether it is possible
to turn persisting results off.

3. Does the data need to be fully persisted before the next stage can start? Or will the following
task start pulling data before everything is written out?

4. Is the main motivation for persisting to have faster recovery times on failures (e.g. checkpointing)?

Best wishes,


View raw message