spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <>
Subject [jira] [Updated] (SPARK-1296) Make RDDs Covariant
Date Fri, 04 Apr 2014 00:42:14 GMT


Matei Zaharia updated SPARK-1296:

    Fix Version/s:     (was: 1.0.0)

> Make RDDs Covariant
> -------------------
>                 Key: SPARK-1296
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Michael Armbrust
>            Assignee: Michael Armbrust
> First, what is the problem with RDDs not being covariant
> {code}
> // Consider a function that takes a Seq of some trait.
> scala> trait A { val a = 1 }
> scala> def f(as: Seq[A]) =
> // A list of a concrete version of that trait can be used in this function.
> scala> class B extends A
> scala> f(new B :: Nil)
> res0: Seq[Int] = List(1)
> // Now lets try the same thing with RDDs
> scala> def f(as: org.apache.spark.rdd.RDD[A]) =
> scala> val rdd = sc.parallelize(new B :: Nil)
> rdd: org.apache.spark.rdd.RDD[B] = ParallelCollectionRDD[2] at parallelize at <console>:42
> // :(
> scala> f(rdd)
> <console>:45: error: type mismatch;
>  found   : org.apache.spark.rdd.RDD[B]
>  required: org.apache.spark.rdd.RDD[A]
> Note: B <: A, but class RDD is invariant in type T.
> You may wish to define T as +T instead. (SLS 4.5)
>               f(rdd)
> {code}
> h2. Is it possible to make RDDs covariant?
> Probably?  In terms of the public user interface, they are *mostly* covariant. (Internally
we use the type parameter T in a lot of mutable state that breaks the covariance contract,
but I think with casting we can 'promise' the compiler that we are behaving).  There are also
a lot of complications with other types that we return which are invariant.
> h2. What will it take to make RDDs covariant?
> As I mention above, all of our mutable internal state is going to require casting to
avoid using T.  This seems to be okay, it makes our life only slightly harder. This extra
work required because we are basically promising the compiler that even if an RDD is implicitly
upcast, internally we are keeping all the checkpointed data of the correct type. Since an
RDD is immutable, we are okay!
> We also need to modify all the places where we use T in function parameters.  So for
> {code}
> def ++[U >: T : ClassTag](other: RDD[U]): RDD[U] = this.union(other).asInstanceOf[RDD[U]]
> {code}
> We are now allowing you to append an RDD of a less specific type, and then returning
a less specific new RDD.  This I would argue is a good change. We are strictly improving the
power of the RDD interface, while maintaining reasonable type semantics.
> h2. So, why wouldn't we do it?
> There are a lot of places where we interact with invariant types.  We return both Maps
and Arrays from a lot of public functions.  Arrays are invariant (but if we returned immutable
sequences instead.... we would be good), and Maps are invariant in the Key (once again, immutable
sequences of tuples would be great here).
> I don't think this is a deal breaker, and we may even be able to get away with it, without
changing the returns types of these functions.  For example, I think that this should work,
though once again requires make promises to the compiler:
> {code}
>   /**
>    * Return an array that contains all of the elements in this RDD.
>    */
>   def collect[U >: T](): Array[U] = {
>     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>     Array.concat(results: _*).asInstanceOf[Array[U]]
>   }
> {code}
> I started working on this [here|].
 Thoughts / suggestions are welcome!

This message was sent by Atlassian JIRA

View raw message