spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peng Cheng <>
Subject If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?
Date Fri, 16 Jan 2015 18:37:43 GMT
I'm talking about RDD1 (not persisted or checkpointed) in this situation:

...(somewhere) -> RDD1 -> RDD2
                              |                |
                             V               V
                             RDD3 -> RDD4 -> Action!

To my experience the change RDD1 get recalculated is volatile, sometimes
once, sometimes twice. When calculation of this RDD is expensive (e.g.
involves using an RESTful service that charges me money), this compels me to
persist RDD1 which takes extra memory, and in case the Action! doesn't
always happen, I don't know when to unpersist it to  free those memory.

A related problem might be in $SQLContest.jsonRDD(), since the source
jsonRDD is used twice (one for schema inferring, another for data read). It
almost guarantees that the source jsonRDD is calculated twice. Has this
problem be addressed so far?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message