crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Percentile rank
Date Tue, 07 Apr 2015 02:06:48 GMT
Hey Andre,

Not sure what you mean precisely-- do you mean an option or method in the
Sort API that would include the rank of each item?

In general, I like to avoid assuming that one reducer can handle all of the
data in a PCollection on API methods, which I think is what you're saying
(i.e., just stream all of the data in sorted order to a single reducer.)


On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <>

> Hi Josh,
> Thanks for replying.
> That really sounds very hacky. I was expecting something with a little
> more support from the API.
> I guess we could also use sortAndApply with a random generated singleton
> Key for the entire set of values and then use the Iterable on the Values to
> obtain the sorted index. It still looks bad though...
> Just out of curiosity, why isn't the Iterable approach also supported on
> the simple Sort.sort? Sorry if this looks obvious to you, but I'm still new
> to Crunch and Hadoop.
> Thanks.
> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <> wrote:
>> I can't think of a great way to do it-- knowing exactly which record
>> you're processing (in any kind of order) in a distributed processing job is
>> always somewhat fraught. Gun to my head, I would do it in two phases:
>> 1) Get the name of the FileSplit for the current task-- which can be
>> retrieved, although we don't make it easy. You can do it via something like
>> this from inside of a map-side DoFn:
>> InputSplit split = ((MapContext) getContext()).getInputSplit();
>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();
>> The count up the number of records inside of each FileSplit. I'm not sure
>> if you should disable combine files when you do this, but it seems like a
>> good idea.
>> 2) Create a new DoFn that takes the output of the previous job and uses
>> it to determine exactly which record in order the currently processing
>> record is, based on the sorted order of the FileSplit names and an internal
>> counter that gets reset to zero for each new FileSplit.
>> J
>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <>
>> wrote:
>>> Hi,
>>> I'm trying to calculate the percentile ranks for the values of a sorted
>>> PTable (i.e. at which % rank each element is within the whole data set). Is
>>> there a way to do this with Crunch? Seems that we would only need to have
>>> access to the global index of the record during an iteration over the data
>>> set.
>>> Thanks in advance,
>>> André
>> --
>> Director of Data Science
>> Cloudera <>
>> Twitter: @josh_wills <>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message