spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yadid Ayzenberg <ya...@media.mit.edu>
Subject Re: JavaPairRDD mapPartitions side effects
Date Mon, 09 Dec 2013 19:25:53 GMT
yes, I understand that I lost the original reference to my RDD.
my questions is regarding the new reference (made by tempRDD). I should 
have 2 references to two different points in the lineage chain.
One is myRDD and the other is tempRDD. it seems that after running the 
transformation referred to by tempRDD, I have altered myRDD as well.
(or at least that's what I think is happening). When I try to run an 
additional transformation, it seems that that myRDD is now pointing to a 
point in the lineage which is after the 2nd transformation.

Yadid


On 12/9/13 2:17 PM, Mark Hamstra wrote:
> Neither map nor mapPartitions mutates an RDD -- if you establish an 
> immutable reference to an rdd (e.g., in Scala,  val rdd = ...), the 
> data contained within that RDD will be the same regardless of any map 
> or mapPartition transformations.  However, if you re-assign the 
> reference to point to the transformed RDD (as you do with myRDD = 
> myRDD.mapPartitions(...)), then you've lost the reference to the 
> original (un-mutated) state of the RDD and have only a reference to 
> the RDD-with-tranformation-applied.  That doesn't make the RDD mutable 
> nor does it make either map or mapPartitions a side-effecting mutator 
> -- you've just changed where in a lineage of transformations you are 
> pointing to with your mutable myRDD reference.
>
>
> On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg <yadid@media.mit.edu 
> <mailto:yadid@media.mit.edu>> wrote:
>
>
>     Hi all,
>
>     Im noticing some strange behavior when running mapPartitions.
>     Pseudo code:
>
>     JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>     myRDD.mapPartitions( func )
>
>     myRDD.count()
>
>     ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double, Double>>,
>     List<Tuple2<Double, Double>>>>>tempRDD = myRDD.mapPartitions(func2
)
>
>     tempRDD.count()
>
>
>     JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>     myRDD.mapPartitions( func )
>
>
>     It seems that mapPartitions has side-effects. When I try running
>     the last line - its seems that contents of myRDD have been changed
>     by the previous map. I thought the RDD were immutable and that It
>     was only possible to generate new RDDs using map. Is this incorrect?
>
>
>     Thanks,
>     Yadid
>
>


Mime
View raw message