spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: How to get rdd count() without double evaluation of the RDD?
Date Thu, 26 Mar 2015 16:27:07 GMT
To avoid computing twice you need to persist the RDD but that need not be
in memory. You can persist to disk with persist().
On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" <
ningjun.wang@lexisnexis.com> wrote:

>  I have a rdd that is expensive to compute. I want to save it as object
> file and also print the count. How can I avoid double computation of the
> RDD?
>
>
>
> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>
>
>
> val count = rdd.count()  // this force computation of the rdd
>
> println(count)
>
> rdd.saveAsObjectFile(file2) // this compute the RDD again
>
>
>
> I can avoid double computation by using cache
>
>
>
> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>
> rdd.cache()
>
> val count = rdd.count()
>
> println(count)
>
> rdd.saveAsObjectFile(file2) // this compute the RDD again
>
>
>
> This only compute rdd once. However the rdd has millions of items and will
> cause out of memory.
>
>
>
> Question: how can I avoid double computation without using cache?
>
>
>
>
>
> Ningjun
>

Mime
View raw message