spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Ayoub <>
Subject Modifying an RDD in forEach
Date Sat, 06 Dec 2014 12:42:00 GMT
This is from a separate thread with a differently named title. 
Why can't you modify the actual contents of an RDD using forEach? It appears to be working
for me. What I'm doing is changing cluster assignments and distances per data item for each
iteration of the clustering algorithm. The clustering algorithm is massive and iterates thousands
of times. As I understand it now, you are supposed to create new RDDs on each pass. This is
a hierachical k-means that I'm doing and hence it is consist of many iterations rather than
large iterations.
So I understand the restriction of why operation when aggregating and reducing etc, need to
be associative. However, forEach operates on a single item. So being that Spark is advertised
as being great for iterative algorithms since it operates in-memory, how can it be good to
create thousands upon thousands of RDDs during the course of an iterative algorithm?  Does
Spark have some kind of trick like reuse behind the scenes - fully persistent data objects
or whatever? How can it possibly be efficient for 'iterative' algorithms when it is creating
so many RDDs as opposed to one? 
Or is the answer that I should keep doing what I'm doing because it is working even though
it is not theoretically sound and aligned with functional ideas. I personally just want it
to be fast and be able to operate on up to 500 million data items.  		 	   		  
View raw message