spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?
Date Tue, 20 Jan 2015 00:53:36 GMT
Hi,

On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng <pc175@uow.edu.au> wrote:

> I'm talking about RDD1 (not persisted or checkpointed) in this situation:
>
> ...(somewhere) -> RDD1 -> RDD2
>                               |                |
>                              V               V
>                              RDD3 -> RDD4 -> Action!
>
> To my experience the change RDD1 get recalculated is volatile, sometimes
> once, sometimes twice.


That should not happen if your access pattern to RDD2 and RDD3 is always
the same.

A related problem might be in $SQLContest.jsonRDD(), since the source
> jsonRDD is used twice (one for schema inferring, another for data read). It
> almost guarantees that the source jsonRDD is calculated twice. Has this
> problem be addressed so far?
>

That's exactly why schema inference is expensive. However, I am afraid in
general you have to make a decision between "store" or "recompute" (cf.
http://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff). There is no way
to avoid recomputation on each access except than storing the value, I
guess.

Tobias

Mime
View raw message