spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: RDD cache question
Date Sun, 01 Dec 2013 02:24:23 GMT
Your question doesn't really make any sense without specifying where any
RDD actions take place (i.e. where Spark jobs are actually run.)  Without
any actions, all you've outlined so far are different ways to specify the
chain of transformations that should be evaluated when an action is
eventually called and a job runs.  In a real sense your code hasn't
actually done anything yet.


On Sat, Nov 30, 2013 at 6:01 PM, Yadid Ayzenberg <yadid@media.mit.edu>wrote:

>
>
>
> Hi All,
>
> Im trying to implement the following and would like to know in which
> places I should be calling RDD.cache():
>
> Suppose I have a group of RDDs : RDD1 to RDDn as input.
>
> 1. create a single RDD_total = RDD1.union(RDD2)..union(RDDn)
>
> 2. for i = 0 to x:    RDD_total = RDD_total.map (some map function());
>
> 3. return RDD_total.
>
> I that I should cache RDD total in order to optimize the iterations.
> Should I just be calling RDD_total.cache() at the end of each iteration ?
> or should I be preforming something more elaborate:
>
>
> RDD_temp = RDD_total.map (some map function());
> RDD_total.unpersist();
> RDD_total = RDD_temp.cache();
>
>
>
> Thanks,
> Yadid
>
>
>
>
>
>
>

Mime
View raw message