I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD?

 

val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))

 

val count = rdd.count()  // this force computation of the rdd

println(count)

rdd.saveAsObjectFile(file2) // this compute the RDD again

 

I can avoid double computation by using cache

 

val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))

rdd.cache()

val count = rdd.count() 

println(count)

rdd.saveAsObjectFile(file2) // this compute the RDD again

 

This only compute rdd once. However the rdd has millions of items and will cause out of memory.

 

Question: how can I avoid double computation without using cache?

 

 

Ningjun