Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124

To fully understand what's going on from a Scala beginner's point of view, you'll have to look up ClassTags, context bounds (the "K : ClassTag" syntax), and implicit functions. Fortunately, you don't have to understand monads...


On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers <koert@tresata.com> wrote:
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {

    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann <daniel.siegmann@velos.io> wrote:
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


That works fine, but lets say I replace the type of the key with a generic type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


This latter function gets the compiler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong?

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io