spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshat Aranya <aara...@gmail.com>
Subject partitioned groupBy
Date Tue, 16 Sep 2014 23:27:13 GMT
I have a use case where my RDD is set up such:

Partition 0:
K1 -> [V1, V2]
K2 -> [V2]

Partition 1:
K3 -> [V1]
K4 -> [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 -> [K1]
V2 -> [K1, K2]

Partition 1:
V1 -> [K3]
V3 -> [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?

Mime
View raw message