spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com>
Subject RE: How to get rdd count() without double evaluation of the RDD?
Date Mon, 30 Mar 2015 17:08:42 GMT
Sean

Yes I know that I can use persist() to persist to disk, but it is still a big extra cost of
persist a huge RDD to disk. I hope that I can do one pass to get the count as well as rdd.saveAsObjectFile(file2),
but I don’t know how.

May be use accumulator to count the total ?

Ningjun

From: Mark Hamstra [mailto:mark@clearstorydata.com]
Sent: Thursday, March 26, 2015 12:37 PM
To: Sean Owen
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get rdd count() without double evaluation of the RDD?

You can also always take the more extreme approach of using SparkContext#runJob (or submitJob)
to write a custom Action that does what you want in one pass.  Usually that's not worth the
extra effort.

On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <sowen@cloudera.com<mailto:sowen@cloudera.com>>
wrote:

To avoid computing twice you need to persist the RDD but that need not be in memory. You can
persist to disk with persist().
On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" <ningjun.wang@lexisnexis.com<mailto:ningjun.wang@lexisnexis.com>>
wrote:
I have a rdd that is expensive to compute. I want to save it as object file and also print
the count. How can I avoid double computation of the RDD?

val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))

val count = rdd.count()  // this force computation of the rdd
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

I can avoid double computation by using cache

val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
rdd.cache()
val count = rdd.count()
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

This only compute rdd once. However the rdd has millions of items and will cause out of memory.

Question: how can I avoid double computation without using cache?


Ningjun

Mime
View raw message