spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Ayoub <ronalday...@live.com>
Subject RE: Java RDD Union
Date Sat, 06 Dec 2014 12:28:58 GMT
With that said, and the nature of iterative algorithms that Spark is advertised for, isn't
this a bit of an unnecessary restriction since I don't see where the problem is. For instance,
it is clear that when aggregating you need operations to be associative because of the way
they are divided and combined. But since forEach works on an individual item the same problem
doesn't exist. 
As an example, during a k-means algorithm you have to continually update cluster assignments
per data item along with perhaps distance from centroid.  So if you can't update items in
place you have to literally create thousands upon thousands of RDDs. Does Spark have some
kind of trick like reuse behind the scenes - fully persistent data objects or whatever. How
can it possibly be efficient for 'iterative' algorithms when it is creating so many RDDs as
opposed to one? 

> From: sowen@cloudera.com
> Date: Fri, 5 Dec 2014 14:58:37 -0600
> Subject: Re: Java RDD Union
> To: ronaldayoub@live.com; user@spark.apache.org
> 
> foreach also creates a new RDD, and does not modify an existing RDD.
> However, in practice, nothing stops you from fiddling with the Java
> objects inside an RDD when you get a reference to them in a method
> like this. This is definitely a bad idea, as there is certainly no
> guarantee that any other operations will see any, some or all of these
> edits.
> 
> On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub <ronaldayoub@live.com> wrote:
> > I tricked myself into thinking it was uniting things correctly. I see I'm
> > wrong now.
> >
> > I have a question regarding your comment that RDD are immutable. Can you
> > change values in an RDD using forEach. Does that violate immutability. I've
> > been using forEach to modify RDD but perhaps I've tricked myself once again
> > into believing it is working. I have object reference so perhaps it is
> > working serendipitously in local mode since the references are in fact not
> > changing but there are referents are and somehow this will no longer work
> > when clustering.
> >
> > Thanks for comments.
> >
> >> From: sowen@cloudera.com
> >> Date: Fri, 5 Dec 2014 14:22:38 -0600
> >> Subject: Re: Java RDD Union
> >> To: ronaldayoub@live.com
> >> CC: user@spark.apache.org
> >
> >>
> >> No, RDDs are immutable. union() creates a new RDD, and does not modify
> >> an existing RDD. Maybe this obviates the question. I'm not sure what
> >> you mean about releasing from memory. If you want to repartition the
> >> unioned RDD, you repartition the result of union(), not anything else.
> >>
> >> On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub <ronaldayoub@live.com> wrote:
> >> > I'm a bit confused regarding expected behavior of unions. I'm running on
> >> > 8
> >> > cores. I have an RDD that is used to collect cluster associations
> >> > (cluster
> >> > id, content id, distance) for internal clusters as well as leaf clusters
> >> > since I'm doing hierarchical k-means and need all distances for sorting
> >> > documents appropriately upon examination.
> >> >
> >> > It appears that Union simply adds items in the argument to the RDD
> >> > instance
> >> > the method is called on rather than just returning a new RDD. If I want
> >> > to
> >> > do Union this was as more of an add/append should I be capturing the
> >> > return
> >> > value and releasing it from memory. Need help clarifying the semantics
> >> > here.
> >> >
> >> > Also, in another related thread someone mentioned coalesce after union.
> >> > Would I need to do the same on the instance RDD I'm calling Union on.
> >> >
> >> > Perhaps a method such as append would be useful and clearer.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
 		 	   		  
Mime
View raw message