crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <>
Subject Re: Percentile rank
Date Tue, 07 Apr 2015 13:27:17 GMT
I lose at proofreading.  Had to completely rewrite a section of one of my
pipelines because of that issue.

On Tue, Apr 7, 2015 at 9:25 AM David Ortiz <> wrote:

> That would be the expectation.  Depending on the number of records though,
> it's possible to start getting OutOfMemoryErrors thrown by the Hadoop
> framework during the shuffle/sort phase.  Had to completely a section of
> one of my pipelines because once we ran it on production level data that
> was happening.  Depending on what else you're running on the cluster, that
> particular issue will also be very disruptive to other jobs.
> On Tue, Apr 7, 2015 at 3:27 AM André Pinto <>
> wrote:
>> Hi Josh,
>> Yes. I guess the reasoning to no have the Iterable on Sort.sort but have
>> it on the Secondary Sort was to avoid people using it on the complete data
>> set then (and it is assumed that there will be never that much records with
>> the same key, so it will be OK to iterate over those few records). Seems
>> reasonable.
>> Yes, using the Iterable on a single reducer is certainly not the best way
>> to do this, but considering that there is no (simple) access to the global
>> index I think there is really no other way. At least iterating over the
>> Iterable will not move all the data into memory right? It does lazy
>> loading, so it will just take a lot longer than doing it in parallel.
>> Thanks.
>> On Tue, Apr 7, 2015 at 4:06 AM, Josh Wills <> wrote:
>>> Hey Andre,
>>> Not sure what you mean precisely-- do you mean an option or method in
>>> the Sort API that would include the rank of each item?
>>> In general, I like to avoid assuming that one reducer can handle all of
>>> the data in a PCollection on API methods, which I think is what you're
>>> saying (i.e., just stream all of the data in sorted order to a single
>>> reducer.)
>>> J
>>> On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <
>>> > wrote:
>>>> Hi Josh,
>>>> Thanks for replying.
>>>> That really sounds very hacky. I was expecting something with a little
>>>> more support from the API.
>>>> I guess we could also use sortAndApply with a random generated
>>>> singleton Key for the entire set of values and then use the Iterable on the
>>>> Values to obtain the sorted index. It still looks bad though...
>>>> Just out of curiosity, why isn't the Iterable approach also supported
>>>> on the simple Sort.sort? Sorry if this looks obvious to you, but I'm still
>>>> new to Crunch and Hadoop.
>>>> Thanks.
>>>> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <> wrote:
>>>>> I can't think of a great way to do it-- knowing exactly which record
>>>>> you're processing (in any kind of order) in a distributed processing
job is
>>>>> always somewhat fraught. Gun to my head, I would do it in two phases:
>>>>> 1) Get the name of the FileSplit for the current task-- which can be
>>>>> retrieved, although we don't make it easy. You can do it via something
>>>>> this from inside of a map-side DoFn:
>>>>> InputSplit split = ((MapContext) getContext()).getInputSplit();
>>>>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();
>>>>> The count up the number of records inside of each FileSplit. I'm not
>>>>> sure if you should disable combine files when you do this, but it seems
>>>>> like a good idea.
>>>>> 2) Create a new DoFn that takes the output of the previous job and
>>>>> uses it to determine exactly which record in order the currently processing
>>>>> record is, based on the sorted order of the FileSplit names and an internal
>>>>> counter that gets reset to zero for each new FileSplit.
>>>>> J
>>>>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <
>>>>>> wrote:
>>>>>> Hi,
>>>>>> I'm trying to calculate the percentile ranks for the values of a
>>>>>> sorted PTable (i.e. at which % rank each element is within the whole
>>>>>> set). Is there a way to do this with Crunch? Seems that we would
only need
>>>>>> to have access to the global index of the record during an iteration
>>>>>> the data set.
>>>>>> Thanks in advance,
>>>>>> André
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <>
>>>>> Twitter: @josh_wills <>
>>> --
>>> Director of Data Science
>>> Cloudera <>
>>> Twitter: @josh_wills <>

View raw message