spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: How to get rdd count() without double evaluation of the RDD?
Date Thu, 26 Mar 2015 16:36:35 GMT
You can also always take the more extreme approach of using
SparkContext#runJob (or submitJob) to write a custom Action that does what
you want in one pass.  Usually that's not worth the extra effort.

On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <sowen@cloudera.com> wrote:

> To avoid computing twice you need to persist the RDD but that need not be
> in memory. You can persist to disk with persist().
> On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" <
> ningjun.wang@lexisnexis.com> wrote:
>
>>  I have a rdd that is expensive to compute. I want to save it as object
>> file and also print the count. How can I avoid double computation of the
>> RDD?
>>
>>
>>
>> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>>
>>
>>
>> val count = rdd.count()  // this force computation of the rdd
>>
>> println(count)
>>
>> rdd.saveAsObjectFile(file2) // this compute the RDD again
>>
>>
>>
>> I can avoid double computation by using cache
>>
>>
>>
>> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line))
>>
>> rdd.cache()
>>
>> val count = rdd.count()
>>
>> println(count)
>>
>> rdd.saveAsObjectFile(file2) // this compute the RDD again
>>
>>
>>
>> This only compute rdd once. However the rdd has millions of items and
>> will cause out of memory.
>>
>>
>>
>> Question: how can I avoid double computation without using cache?
>>
>>
>>
>>
>>
>> Ningjun
>>
>

Mime
View raw message