spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chawla,Sumit " <sumitkcha...@gmail.com>
Subject Re: RepartitionByKey Behavior
Date Wed, 27 Jun 2018 04:52:06 GMT
Thanks everyone.  As Nathan suggested,  I ended up collecting the distinct
keys first and then assigning Ids to each key explicitly.

Regards
Sumit Chawla


On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit <sumitkchawla@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>>  I have been trying to this simple operation.  I want to land all
>>>>> values with one key in same partition, and not have any different key
in
>>>>> the same partition.  Is this possible?   I am getting b and c always
>>>>> getting mixed up in the same partition.
>>>>>
>>>>>
>>>>>
> I think you could do something approsimately like:
>
>      val keys = rdd.map(_.getKey).distinct.zipWithIndex
>      val numKey = keys.map(_._2).count
>      rdd.map(r => (r.getKey, r)).join(keys).partitionBy(new Partitioner()
> {def numPartitions=numKeys;def getPartition(key: Any) =
> key.asInstanceOf[Long].toInt})
>
> i.e., key by a unique number, count that, and repartition by key to the
> exact count.  This presumes, of course, that the number of keys is <MAXINT.
>
> Also, I haven't tested this code, so don't take it as anything more than
> an approximate idea, please :-)
>
>                      -Nathan Kronenfeld
>

Mime
View raw message