spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From canan chen <ccn...@gmail.com>
Subject Re: Intermedate stage will be cached automatically ?
Date Wed, 17 Jun 2015 13:10:17 GMT
Yes, actually on the storage ui, there's no data cached. But the behavior
confuse me. If I call the cache method as following the behavior is the
same as without calling cache method, why's that ?


val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2)
data.cache()
println(data.count())
println(data.count())



On Wed, Jun 17, 2015 at 8:45 PM, ayan guha <guha.ayan@gmail.com> wrote:

> Its not cached per se. For example, you will not see this in Storage tab
> in UI. However, spark has read the data and its in memory right now. So,
> the next count call should be very fast.
>
>
> Best
> Ayan
>
> On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse <Mark.Tse@d2l.com> wrote:
>
>>  I think
>> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
>> might shed some light on the behaviour you’re seeing.
>>
>>
>>
>> Mark
>>
>>
>>
>> *From:* canan chen [mailto:ccnfdu@gmail.com]
>> *Sent:* June-17-15 5:57 AM
>> *To:* spark users
>> *Subject:* Intermedate stage will be cached automatically ?
>>
>>
>>
>> Here's one simple spark example that I call RDD#count 2 times. The first
>> time it would invoke 2 stages, but the second one only need 1 stage. Seems
>> the first stage is cached. Is that true ? Any flag can I control whether
>> the cache the intermediate stage
>>
>>
>>     *val *data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _,
2)
>>     *println*(data.count())
>>     *println*(data.count())
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message