Any reason why you need exactly a certain number of partitions?
One way we can make that work is for RangePartitioner to return a bunch of
empty partitions if the number of distinct elements is small. That would
require changing Spark.
If you want a quick work around, you can also append some random value to
your key, before running range partitioning, and then just remove those
random value post range partitioning.
On Wed, Jul 22, 2015 at 2:37 AM, Sergio Ramírez wrote:
>
> Hi all:
>
> I am developing an algorithm that needs to put together elements with the
> same key as much as possible but with always using a fixed number of
> partitions. To do that, this algorithm sorts by key the elements. The
> problem is that the number of distinct keys influences in the number of
> final partitions. For example, if I define 200 distinct keys and 800
> partitions in the *sortByKey* function, the resulting number of
> partitions is equal to 202.
>
> I have took a look to the code and I have found this:
>
> Note that the actual number of partitions created by the RangePartitioner
> might not be the same
> as the `partitions` parameter, in the case where the number of sampled
> records is less than the value of `partitions`.
>
> I have tried with *repartition* with *RangePartitioner* with the same
> result (obvious).
>
> ¿Is there any function that can solve my problem, like
> *repartitionAndSortWithinPartitions*? ¿Is there any sequence of
> instructions that can help me? If not, I think it can become a real problem
> to sort cases in which the number of rows is huge and the number of
> distinct keys is small.
>
> Thanks in advance,
>
> Sergio R.
>
>
>