spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Ayoub <>
Subject RDD lineage and broadcast variables
Date Fri, 12 Dec 2014 16:52:09 GMT
I'm still wrapping my head around that fact that the data backing an RDD is immutable since
an RDD may need to be reconstructed from its lineage at any point. In the context of clustering
there are many iterations where an RDD may need to change (for instance cluster assignments,
etc) based on a broadcast variable of a list of centroids which are objects that in turn contain
a list of features. So immutability is all well and good for the purposes of being able to
replay a lineage. But now I'm wondering, during each iterations in which this RDD goes through
many transformations it will be transforming based on that broadcast variable of centroids
that are mutable. How would it replay the lineage in this instance? Does a dependency on mutable
variables mess up the whole lineage thing?
Any help appreciated. Just trying to wrap my head around using Spark correctly. I will say
it does seem like there is a common miss conception that Spark RDDs are in-memory arrays -
but perhaps this is for a reason. Perhaps in some cases an option for mutability and failure
exception is exactly what is needed for a one off algorithm that doesn't necessarily need
resiliency. Just a thought.  		 	   		  
View raw message