spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Giraph Vs SPARK
Date Thu, 23 Jan 2014 22:41:48 GMT
The data gets written to files for fault tolerance, in case we need to re-run a reduce task
and re-fetch the files after. Otherwise, we’d have to re-run *all* the map tasks whenever
one reduce task fails. However, these files usually remain in the OS buffer cache so they
are written and read at memory speed. In the future we might add a setting that skips this
and uses Spark’s memory store for shuffle data instead.

On the reduce side there’s no use of disk except in Spark 0.9, where we added the option
to spill to disk if the reduce’s inputs don’t fit in memory.

Matei

On Jan 23, 2014, at 2:25 PM, suman bharadwaj <suman.dna@gmail.com> wrote:

> Hi,
> 
> Sorry for the confusion. 
> 
> So let me rephrase my question.
> 
> Why does SPARK have to write the intermediate data to disk when there is a shuffle dependency?
Can't the communication happen directly just like Giraph ?
> And does data get written at reducer side as well ?
> 
> Again please feel free to correct me, in case my understanding is incorrect.
> 
> Regards,
> SB
> 
> 
> On Fri, Jan 24, 2014 at 3:44 AM, Jey Kottalam <jey@cs.berkeley.edu> wrote:
> Hi Suman,
> 
> Spark does indeed do in-memory computation, and does not require
> spilling to disk after every map task. Could you explain where you
> "see that intermediate map outputs gets written to disk"? Perhaps
> you're seeing some intermediate results during a shuffle phase? In
> that case, you may want to look into the
> "spark.shuffle.consolidateFiles" option:
> https://spark.incubator.apache.org/docs/0.8.1/configuration.html
> 
> -Jey
> 
> On Thu, Jan 23, 2014 at 1:10 PM, suman bharadwaj <suman.dna@gmail.com> wrote:
> > Hi,
> >
> > I might be wrong, but need your help.
> >
> > My understanding in Giraph is that, it doesn't write the intermediate data
> > to disk while sending messages to different machines. But in SPARK, I see
> > that intermediate map outputs gets written to disk. Why does SPARK write
> > intermediate data to disk ?
> >
> > What happens at reducer side ? Does SPARK write the data again to disk ? How
> > does it differ from Hadoop MR ?
> >
> > Can't SPARK communicate everything in memory ?
> >
> > If my understanding is wrong. Please do correct me.
> >
> > Regards,
> > Suman Bharadwaj S
> 


Mime
View raw message