Thanks Sean!
Do you know if there is a way (even manually) to delete these intermediate shuffle results? I was just want to test the "expected" behaviour. I know that re-caching might be a positive action most of the times but I want to try it without it.


Renato M.

2015-03-30 12:15 GMT+02:00 Sean Owen <sowen@cloudera.com>:
I think that you get a sort of "silent" caching after shuffles, in
some cases, since the shuffle files are not immediately removed and
can be reused.

(This is the flip side to the frequent question/complaint that the
shuffle files aren't removed straight away.)

On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo
<renatoj.marroquin@gmail.com> wrote:
> Hi all,
>
> I am trying to understand Spark lazy evaluation works, and I need some help.
> I have noticed that creating an RDD once and using it many times won't
> trigger recomputation of it every time it gets used. Whereas creating a new
> RDD for every time a new operation is performed will trigger recomputation
> of the whole RDD again.
> I would have thought that both approaches behave similarly (i.e. not
> caching) due to Spark's lazy evaluation strategy, but I guess Spark is keeps
> track of the RDD used and of the partial results computed so far so it
> doesn't do unnecessary extra work. Could anybody point me to where Spark
> decides what to cache or how I can disable this behaviour?
> Thanks in advance!
>
>
> Renato M.
>
> Approach 1 --> this doesn't trigger recomputation of the RDD in every
> iteration
> =========
> JavaRDD aggrRel =
> Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
> for (int i = 0; i < NUM_RUNS; i++) {
>    // doing some computation like aggrRel.count()
>    . . .
> }
>
> Approach 2 --> this triggers recomputation of the RDD in every iteration
> =========
> for (int i = 0; i < NUM_RUNS; i++) {
>    JavaRDD aggrRel =
> Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
>    // doing some computation like aggrRel.count()
>    . . .
> }