spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yadid Ayzenberg <ya...@media.mit.edu>
Subject Re: JavaPairRDD mapPartitions side effects
Date Mon, 09 Dec 2013 19:40:49 GMT
I guess that's the answer then , I am mutating my cached objects.
I will probably need to create a new copy of those objects within the 
transformation to avoid this.


On 12/9/13 2:33 PM, Josh Rosen wrote:
> Spark doesn't perform defensive copying before passing cached objects 
> to your user-defined functions, so if you're caching mutable Java 
> objects and mutating them in a transformation, the effects of that 
> mutation might change the cached data and affect other RDDs.  Is func2 
> mutating your cached objects?
>
>
> On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg <yadid@media.mit.edu 
> <mailto:yadid@media.mit.edu>> wrote:
>
>     yes, I understand that I lost the original reference to my RDD.
>     my questions is regarding the new reference (made by tempRDD). I
>     should have 2 references to two different points in the lineage chain.
>     One is myRDD and the other is tempRDD. it seems that after running
>     the transformation referred to by tempRDD, I have altered myRDD as
>     well.
>     (or at least that's what I think is happening). When I try to run
>     an additional transformation, it seems that that myRDD is now
>     pointing to a point in the lineage which is after the 2nd
>     transformation.
>
>     Yadid
>
>
>
>     On 12/9/13 2:17 PM, Mark Hamstra wrote:
>>     Neither map nor mapPartitions mutates an RDD -- if you establish
>>     an immutable reference to an rdd (e.g., in Scala,  val rdd =
>>     ...), the data contained within that RDD will be the same
>>     regardless of any map or mapPartition transformations.  However,
>>     if you re-assign the reference to point to the transformed RDD
>>     (as you do with myRDD = myRDD.mapPartitions(...)), then you've
>>     lost the reference to the original (un-mutated) state of the RDD
>>     and have only a reference to the RDD-with-tranformation-applied.
>>      That doesn't make the RDD mutable nor does it make either map or
>>     mapPartitions a side-effecting mutator -- you've just changed
>>     where in a lineage of transformations you are pointing to with
>>     your mutable myRDD reference.
>>
>>
>>     On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg
>>     <yadid@media.mit.edu <mailto:yadid@media.mit.edu>> wrote:
>>
>>
>>         Hi all,
>>
>>         Im noticing some strange behavior when running mapPartitions.
>>         Pseudo code:
>>
>>         JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>>         myRDD.mapPartitions( func )
>>
>>         myRDD.count()
>>
>>         ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double,
>>         Double>>, List<Tuple2<Double, Double>>>>>tempRDD
=
>>         myRDD.mapPartitions(func2 )
>>
>>         tempRDD.count()
>>
>>
>>         JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>>         myRDD.mapPartitions( func )
>>
>>
>>         It seems that mapPartitions has side-effects. When I try
>>         running the last line - its seems that contents of myRDD have
>>         been changed by the previous map. I thought the RDD were
>>         immutable and that It was only possible to generate new RDDs
>>         using map. Is this incorrect?
>>
>>
>>         Thanks,
>>         Yadid
>>
>>
>
>


Mime
View raw message