spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jem Tucker <jem.tuc...@gmail.com>
Subject Re: Custom Partitioner
Date Wed, 02 Sep 2015 09:15:39 GMT
alter the range partitioner such that it skews the partitioning and assigns
more partitions to the heavier weighted keys? to do this you will have to
know the weighting before you start

On Wed, Sep 2, 2015 at 8:02 AM shahid ashraf <shahid@trialx.com> wrote:

> yes i can take as an example , but my actual use case is that in need to
> resolve a data skew, when i do grouping based on key(A-Z) the resulting
> partitions are skewed like
> (partition no.,no_of_keys, total elements with given key)
> << partition: [(0, 0, 0), (1, 15, 17395), (2, 0, 0), (3, 0, 0), (4, 13,
> 18196), (5, 0, 0), (6, 0, 0), (7, 0, 0), (8, 1, 1), (9, 0, 0)] and
> elements: >>
> the data has been skewed to partition 1 and 4, i need to split the
> partition. and do processing on split partitions and i should be able to
> combine splitted partition back also.
>
> On Tue, Sep 1, 2015 at 10:42 PM, Davies Liu <davies@databricks.com> wrote:
>
>> You can take the sortByKey as example:
>> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642
>>
>> On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker <jem.tucker@gmail.com> wrote:
>> > something like...
>> >
>> > class RangePartitioner(Partitioner):
>> > def __init__(self, numParts):
>> > self.numPartitions = numParts
>> > self.partitionFunction = rangePartition
>> > def rangePartition(key):
>> > # Logic to turn key into a partition id
>> > return id
>> >
>> > On Tue, Sep 1, 2015 at 11:38 AM shahid ashraf <shahid@trialx.com>
>> wrote:
>> >>
>> >> Hi
>> >>
>> >> I think range partitioner is not available in pyspark, so if we want
>> >> create one. how should we create that. my question is that.
>> >>
>> >> On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker <jem.tucker@gmail.com>
>> wrote:
>> >>>
>> >>> Ah sorry I miss read your question. In pyspark it looks like you just
>> >>> need to instantiate the Partitioner class with numPartitions and
>> >>> partitionFunc.
>> >>>
>> >>> On Tue, Sep 1, 2015 at 11:13 AM shahid ashraf <shahid@trialx.com>
>> wrote:
>> >>>>
>> >>>> Hi
>> >>>>
>> >>>> I did not get this, e.g if i need to create a custom partitioner
like
>> >>>> range partitioner.
>> >>>>
>> >>>> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker <jem.tucker@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> You just need to extend Partitioner and override the numPartitions
>> and
>> >>>>> getPartition methods, see below
>> >>>>>
>> >>>>> class MyPartitioner extends partitioner {
>> >>>>>   def numPartitions: Int = // Return the number of partitions
>> >>>>>   def getPartition(key Any): Int = // Return the partition for
a
>> given
>> >>>>> key
>> >>>>> }
>> >>>>>
>> >>>>> On Tue, Sep 1, 2015 at 10:15 AM shahid qadri <
>> shahidashraff@icloud.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Hi Sparkians
>> >>>>>>
>> >>>>>> How can we create a customer partition in pyspark
>> >>>>>>
>> >>>>>>
>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >>>>>> For additional commands, e-mail: user-help@spark.apache.org
>> >>>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> with Regards
>> >>>> Shahid Ashraf
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> with Regards
>> >> Shahid Ashraf
>>
>
>
>
> --
> with Regards
> Shahid Ashraf
>

Mime
View raw message