spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameet Kini <>
Subject customized comparator in groupByKey
Date Wed, 07 May 2014 00:46:21 GMT
I'd like to override the logic of comparing keys for equality in
groupByKey. Kinda like how combineByKey allows you to pass in the combining
logic for "values", I'd like to do the same for keys.

My code looks like this:
val res = rdd.groupBy(myPartitioner)
Here, rdd is of type RDD[(MyKey, MyValue)], so res turns out to be of type
RDD[(MyKey, Seq[MyValue])]

MyKey is defined as case class MyKey(field1: Int, field2: Int)
and myPartitioner's getPartition(key: Any), here key is of type MyKey and
the partitioning logic is an expression on both field1 and field2.

I'm guessing the groupBy uses "equals" to compare like instances of MyKey.
Currently, the "equals" method of MyKey uses both field1 and field2, as
would be natural to its implementation. However, I'd like to have the
groupBy only use field1. Any pointers on how I can go about doing it?

One way is the following, but I'd like to avoid creating all those MyNewKey
val partitionedRdd = rdd.partitionBy(myPartitioner)
val mappedRdd = partitionedRdd.mapPartitions(partition => (myKey, myValue) => (new MyNewKey(myKey.field1),
val groupedRdd = mappedRdd.groupByKey()


View raw message