spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <>
Subject Re: Equally weighted partitions in Spark
Date Sun, 04 May 2014 01:12:27 GMT
@deenar-  i like the custom partitioner strategy that you mentioned.  i
think it's very useful.

as a thought-exercise, is it possible to re-partition your RDD to
more-evenly distribute the long-running tasks among the short-running tasks
by ordering the key's differently?  this would play nice with the existing

or perhaps manipulate the key's hashCode() to more-evenly-distribute the
tasks to play nicely with the existing HashPartitioner?

i don't know if either of these are beneficial, but throwing them out for
the sake of conversation...


On Fri, May 2, 2014 at 11:10 AM, Andrew Ash <> wrote:

> Deenar,
> I haven't heard of any activity to do partitioning in that way, but it
> does seem more broadly valuable.
> On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar <>wrote:
>> I have equal sized partitions now, but I want the RDD to be partitioned
>> such
>> that the partitions are equally weighted by some attribute of each RDD
>> element (e.g. size or complexity).
>> I have been looking at the RangePartitioner code and I have come up with
>> something like
>> EquallyWeightedPartitioner(noOfPartitions, weightFunction)
>> 1) take a sum or (sample) of complexities of all elements and calculate
>> average weight per partition
>> 2) take a histogram of weights
>> 3) assign a list of partitions to each bucket
>> 4)  getPartition(key: Any): Int would
>>   a) get the weight and then find the bucket
>>   b) assign a random partition from the list of partitions associated with
>> each bucket
>> Just wanted to know if someone else had come across this issue before and
>> there was a better way of doing this.
>> --
>> View this message in context:
>> Sent from the Apache Spark User List mailing list archive at

View raw message