spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <>
Subject Re: RDD of ImmutableList
Date Mon, 05 Oct 2015 20:11:30 GMT
If you don't need to write data back using that library I'd say go for #2. Convert to a scala
class and standard lists, should be easier down the line. That being said, you may end up
writing custom code if you stick with kryo anyway...

Sent from my iPhone

On 05 Oct 2015, at 21:42, Jakub Dubovsky <<>>

Thank you for quick reaction.

I have to say this is very surprising to me. I never received an advice to stop using an immutable
approach. Whole RDD is designed to be immutable (which is sort of sabotaged by not being able
to (de)serialize immutable classes properly). I will ask on dev list if this is to be changed
or not.

Ok, I have let go initial feelings and now let's be pragmatic. And this is still for everyone
not just Igor:

I use a class from a library which is immutable. Now I want to use this class to represent
my data in RDD because this saves me a huge amount of work. The class uses ImmutableList as
one of its fields. That's why it fails. But isn't there a way to workaround this? I ask this
because I have exactly zero knowledge about kryo and the way how it works. So for example
would some of these two work?

1) Change the external class so that it implements writeObject, readObject methods (it's java).
Will these methods be used by kryo? (I can ask the maintainers of a library to change the
class if the change is reasonable. Adding these methods would be while dropping immutability
certainly wouldn't)

2) Wrap the class to scala class which would translate the data during (de)serialization?

  Jakub Dubovsky

---------- P?vodn? zpr?va ----------
Od: Igor Berman <<>>
Komu: Jakub Dubovsky <<>>
Datum: 5. 10. 2015 20:11:35
P?edm?t: Re: RDD of ImmutableList

kryo doesn't support guava's collections by default
I remember encountered project in github that fixes this(not sure though). I've ended to stop
using guava collections as soon as spark rdds are concerned.

On 5 October 2015 at 21:04, Jakub Dubovsky <<>>
Hi all,

  I would like to have an advice on how to use ImmutableList with RDD. Small presentation
of an essence of my problem in spark-shell with guava jar added:

scala> import

scala> val arr = Array(ImmutableList.of(1,2), ImmutableList.of(2,4), ImmutableList.of(3,6))
arr: Array[[Int]] = Array([1, 2], [2, 4], [3, 6])

scala> val rdd = sc.parallelize(arr)
rdd: org.apache.spark.rdd.RDD[[Int]] = ParallelCollectionRDD[0]
at parallelize at <console>:24

scala> rdd.count

 This results in kryo exception saying that it cannot add a new element to list instance while
deserialization: java.lang.UnsupportedOperationException
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
        at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
Caused by: java.lang.UnsupportedOperationException

  It somehow makes sense. But I cannot think of a workaround and I do not believe that using
ImmutableList with RDD is not possible. How this is solved?

  Thank you in advance!

   Jakub Dubovsky

View raw message