spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: JavaPairRDD mapPartitions side effects
Date Mon, 09 Dec 2013 19:33:14 GMT
Spark doesn't perform defensive copying before passing cached objects to
your user-defined functions, so if you're caching mutable Java objects and
mutating them in a transformation, the effects of that mutation might
change the cached data and affect other RDDs.  Is func2 mutating your
cached objects?


On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg <yadid@media.mit.edu>wrote:

>  yes, I understand that I lost the original reference to my RDD.
> my questions is regarding the new reference (made by tempRDD). I should
> have 2 references to two different points in the lineage chain.
> One is myRDD and the other is tempRDD. it seems that after running the
> transformation referred to by tempRDD, I have altered myRDD as well.
> (or at least that's what I think is happening). When I try to run an
> additional transformation, it seems that that myRDD is now pointing to a
> point in the lineage which is after the 2nd transformation.
>
> Yadid
>
>
>
> On 12/9/13 2:17 PM, Mark Hamstra wrote:
>
> Neither map nor mapPartitions mutates an RDD -- if you establish an
> immutable reference to an rdd (e.g., in Scala,  val rdd = ...), the data
> contained within that RDD will be the same regardless of any map or
> mapPartition transformations.  However, if you re-assign the reference to
> point to the transformed RDD (as you do with myRDD =
> myRDD.mapPartitions(...)), then you've lost the reference to the original
> (un-mutated) state of the RDD and have only a reference to the
> RDD-with-tranformation-applied.  That doesn't make the RDD mutable nor does
> it make either map or mapPartitions a side-effecting mutator -- you've just
> changed where in a lineage of transformations you are pointing to with your
> mutable myRDD reference.
>
>
> On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg <yadid@media.mit.edu>wrote:
>
>>
>> Hi all,
>>
>> Im noticing some strange behavior when running mapPartitions. Pseudo code:
>>
>> JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>> myRDD.mapPartitions( func )
>>
>> myRDD.count()
>>
>> ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double, Double>>,
>> List<Tuple2<Double, Double>>>>>tempRDD = myRDD.mapPartitions(func2
)
>>
>> tempRDD.count()
>>
>>
>> JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
>> myRDD.mapPartitions( func )
>>
>>
>> It seems that mapPartitions has side-effects. When I try running the last
>> line - its seems that contents of myRDD have been changed by the previous
>> map. I thought the RDD were immutable and that It was only possible to
>> generate new RDDs using map. Is this incorrect?
>>
>>
>> Thanks,
>> Yadid
>>
>>
>
>

Mime
View raw message