spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yadid Ayzenberg <ya...@media.mit.edu>
Subject Re: RDD cache question
Date Sun, 01 Dec 2013 02:39:07 GMT
step 4. would be count(), or collect(). The map() (in step 2.)  would be 
performing calculations and writing information to a DB.

Is this the information that was missing ?

Thanks,

Yadid





On 11/30/13 9:24 PM, Mark Hamstra wrote:
> Your question doesn't really make any sense without specifying where 
> any RDD actions take place (i.e. where Spark jobs are actually run.) 
>  Without any actions, all you've outlined so far are different ways to 
> specify the chain of transformations that should be evaluated when an 
> action is eventually called and a job runs.  In a real sense your code 
> hasn't actually done anything yet.
>
>
> On Sat, Nov 30, 2013 at 6:01 PM, Yadid Ayzenberg <yadid@media.mit.edu 
> <mailto:yadid@media.mit.edu>> wrote:
>
>
>
>
>     Hi All,
>
>     Im trying to implement the following and would like to know in
>     which places I should be calling RDD.cache():
>
>     Suppose I have a group of RDDs : RDD1 to RDDn as input.
>
>     1. create a single RDD_total = RDD1.union(RDD2)..union(RDDn)
>
>     2. for i = 0 to x:    RDD_total = RDD_total.map (some map function());
>
>     3. return RDD_total.
>
>     I that I should cache RDD total in order to optimize the
>     iterations. Should I just be calling RDD_total.cache() at the end
>     of each iteration ? or should I be preforming something more
>     elaborate:
>
>
>     RDD_temp = RDD_total.map (some map function());
>     RDD_total.unpersist();
>     RDD_total = RDD_temp.cache();
>
>
>
>     Thanks,
>     Yadid
>
>
>
>
>
>
>


Mime
View raw message