Also check the `RDD.checkpoint()` method

On Wed, Aug 2, 2017 at 8:46 PM, Vadim Semenov <> wrote:
I'm not sure that "checkpointed" means the same thing in that sentence.

You can run a simple test using `spark-shell`:

val rdd = sc.parallelize(1 to 10).map(x => {
rdd.foreach(println) // Will take 10 seconds
rdd.foreach(println) // Will be instant, because the RDD is checkpointed

On Wed, Aug 2, 2017 at 7:05 PM, jeff saremi <> wrote:


This is from the Mastering Spark book:

"It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation."

To me that means checkpoint will not prevent the recomputation that i was hoping for

From: Vadim Semenov <>
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
val sc: SparkContext
val result1 =
result1.count() // Will save `myrdd` to HDFS and do map(op1…
val result2 =
result2.count() // Will load `myrdd` from HDFS and do map(op2…

On Tue, Aug 1, 2017 at 2:05 PM, jeff saremi <> wrote:

Calling cache/persist fails all our jobs (i have  posted 2 threads on this).

And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:

If I save an RDD to hdfs and read it back, can I use it in more than one operation?

Example: (using cache)
// do a whole bunch of transformations on an RDD


val result1 =

val result2 =

// in the above I am assuming that a call to cache will prevent all previous transformation from being calculated twice

I'd like to somehow get result1 and result2 without duplicating work. How can I do that?