Thanks everyone.  As Nathan suggested,  I ended up collecting the distinct keys first and then assigning Ids to each key explicitly.  

Regards
Sumit Chawla


On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <nkronenfeld@uncharted.software> wrote:
On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit <sumitkchawla@gmail.com> wrote:
Hi 

 I have been trying to this simple operation.  I want to land all values with one key in same partition, and not have any different key in the same partition.  Is this possible?   I am getting b and c always getting mixed up in the same partition. 



I think you could do something approsimately like:

     val keys = rdd.map(_.getKey).distinct.zipWithIndex
     val numKey = keys.map(_._2).count
     rdd.map(r => (r.getKey, r)).join(keys).partitionBy(new Partitioner() {def numPartitions=numKeys;def getPartition(key: Any) = key.asInstanceOf[Long].toInt})

i.e., key by a unique number, count that, and repartition by key to the exact count.  This presumes, of course, that the number of keys is <MAXINT.

Also, I haven't tested this code, so don't take it as anything more than an approximate idea, please :-)

                     -Nathan Kronenfeld