Any reason why you need exactly a certain number of partitions?

One way we can make that work is for RangePartitioner to return a bunch of empty partitions if the number of distinct elements is small. That would require changing Spark.

If you want a quick work around, you can also append some random value to your key, before running range partitioning, and then just remove those random value post range partitioning.


On Wed, Jul 22, 2015 at 2:37 AM, Sergio Ramírez <sramirezga@ugr.es> wrote:

Hi all:

I am developing an algorithm that needs to put together elements with the same key as much as possible but with always using a fixed number of partitions. To do that, this algorithm sorts by key the elements. The problem is that the number of distinct keys influences in the number of final partitions. For example, if I define 200 distinct keys and 800 partitions in the sortByKey function, the resulting number of partitions is equal to 202.

I have took a look to the code and I have found this:

Note that the actual number of partitions created by the RangePartitioner might not be the same
as the `partitions` parameter, in the case where the number of sampled records is less than the value of `partitions`.

I have tried with repartition with RangePartitioner with the same result (obvious).

¿Is there any function that can solve my problem, like repartitionAndSortWithinPartitions? ¿Is there any sequence of instructions that can help me? If not, I think it can become a real problem to sort cases in which the number of  rows is huge and the number of distinct keys is small.

Thanks in advance,

Sergio R.