spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: Using case classes as keys does not seem to work.
Date Tue, 22 Jul 2014 14:30:43 GMT
Just to narrow down the issue, it looks like the issue is in 'reduceByKey'
and derivates like 'distinct'.

groupByKey() seems to work

sc.parallelize(ps).map(x=> (x.name,1)).groupByKey().collect
res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)),
(abe,ArrayBuffer(1)), (bob,ArrayBuffer(1, 1)))



On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
>
> A minimal example:
>
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1),
> (P(bob),1), (P(abe),1), (P(charly),1))
>
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
>
> Any ideas why this doesn't work?
>
> -kr, Gerard.
>

Mime
View raw message