spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peng Cheng <>
Subject If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?
Date Fri, 16 Jan 2015 18:40:07 GMT
I'm talking about RDD1 (not persisted or checkpointed) in this situation:

...(somewhere) -> RDD1 -> RDD2
                              |                |
                             V               V
                             RDD3 -> RDD4 -> Action!

To my experience the change RDD1 get recalculated is volatile, sometimes
once, sometimes twice. When calculation of this RDD is expensive (e.g.
involves using an RESTful service that charges me money), this compels me
to persist RDD1 which takes extra memory, and in case the Action! doesn't
always happen, I don't know when to unpersist it to  free those memory.

A related problem might be in $SQLContest.jsonRDD(), since the source
jsonRDD is used twice (one for schema inferring, another for data read). It
almost guarantees that the source jsonRDD is calculated twice. Is there a
way to solve (or circumvent) this problem?

View raw message