spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <>
Subject Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?
Date Tue, 20 Jan 2015 00:53:36 GMT

On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng <> wrote:

> I'm talking about RDD1 (not persisted or checkpointed) in this situation:
> ...(somewhere) -> RDD1 -> RDD2
>                               |                |
>                              V               V
>                              RDD3 -> RDD4 -> Action!
> To my experience the change RDD1 get recalculated is volatile, sometimes
> once, sometimes twice.

That should not happen if your access pattern to RDD2 and RDD3 is always
the same.

A related problem might be in $SQLContest.jsonRDD(), since the source
> jsonRDD is used twice (one for schema inferring, another for data read). It
> almost guarantees that the source jsonRDD is calculated twice. Has this
> problem be addressed so far?

That's exactly why schema inference is expensive. However, I am afraid in
general you have to make a decision between "store" or "recompute" (cf. There is no way
to avoid recomputation on each access except than storing the value, I


View raw message