spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com>
Subject Re: Spark caching
Date Mon, 30 Mar 2015 10:49:55 GMT
Thanks Sean!
Do you know if there is a way (even manually) to delete these intermediate
shuffle results? I was just want to test the "expected" behaviour. I know
that re-caching might be a positive action most of the times but I want to
try it without it.


Renato M.

2015-03-30 12:15 GMT+02:00 Sean Owen <sowen@cloudera.com>:

> I think that you get a sort of "silent" caching after shuffles, in
> some cases, since the shuffle files are not immediately removed and
> can be reused.
>
> (This is the flip side to the frequent question/complaint that the
> shuffle files aren't removed straight away.)
>
> On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo
> <renatoj.marroquin@gmail.com> wrote:
> > Hi all,
> >
> > I am trying to understand Spark lazy evaluation works, and I need some
> help.
> > I have noticed that creating an RDD once and using it many times won't
> > trigger recomputation of it every time it gets used. Whereas creating a
> new
> > RDD for every time a new operation is performed will trigger
> recomputation
> > of the whole RDD again.
> > I would have thought that both approaches behave similarly (i.e. not
> > caching) due to Spark's lazy evaluation strategy, but I guess Spark is
> keeps
> > track of the RDD used and of the partial results computed so far so it
> > doesn't do unnecessary extra work. Could anybody point me to where Spark
> > decides what to cache or how I can disable this behaviour?
> > Thanks in advance!
> >
> >
> > Renato M.
> >
> > Approach 1 --> this doesn't trigger recomputation of the RDD in every
> > iteration
> > =========
> > JavaRDD aggrRel =
> > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> > for (int i = 0; i < NUM_RUNS; i++) {
> >    // doing some computation like aggrRel.count()
> >    . . .
> > }
> >
> > Approach 2 --> this triggers recomputation of the RDD in every iteration
> > =========
> > for (int i = 0; i < NUM_RUNS; i++) {
> >    JavaRDD aggrRel =
> > Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> >    // doing some computation like aggrRel.count()
> >    . . .
> > }
>

Mime
View raw message