Hey Andre, Not sure what you mean precisely-- do you mean an option or method in the Sort API that would include the rank of each item? In general, I like to avoid assuming that one reducer can handle all of the data in a PCollection on API methods, which I think is what you're saying (i.e., just stream all of the data in sorted order to a single reducer.) J On Mon, Apr 6, 2015 at 3:19 PM, André Pinto wrote: > Hi Josh, > > Thanks for replying. > > That really sounds very hacky. I was expecting something with a little > more support from the API. > > I guess we could also use sortAndApply with a random generated singleton > Key for the entire set of values and then use the Iterable on the Values to > obtain the sorted index. It still looks bad though... > > Just out of curiosity, why isn't the Iterable approach also supported on > the simple Sort.sort? Sorry if this looks obvious to you, but I'm still new > to Crunch and Hadoop. > > Thanks. > > On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills wrote: > >> I can't think of a great way to do it-- knowing exactly which record >> you're processing (in any kind of order) in a distributed processing job is >> always somewhat fraught. Gun to my head, I would do it in two phases: >> >> 1) Get the name of the FileSplit for the current task-- which can be >> retrieved, although we don't make it easy. You can do it via something like >> this from inside of a map-side DoFn: >> >> InputSplit split = ((MapContext) getContext()).getInputSplit(); >> FileSplit baseSplit = (FileSplit) ((Supplier) split).get(); >> >> The count up the number of records inside of each FileSplit. I'm not sure >> if you should disable combine files when you do this, but it seems like a >> good idea. >> >> 2) Create a new DoFn that takes the output of the previous job and uses >> it to determine exactly which record in order the currently processing >> record is, based on the sorted order of the FileSplit names and an internal >> counter that gets reset to zero for each new FileSplit. >> >> J >> >> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto >> wrote: >> >>> Hi, >>> >>> I'm trying to calculate the percentile ranks for the values of a sorted >>> PTable (i.e. at which % rank each element is within the whole data set). Is >>> there a way to do this with Crunch? Seems that we would only need to have >>> access to the global index of the record during an iteration over the data >>> set. >>> >>> Thanks in advance, >>> André >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera >> Twitter: @josh_wills >> > > -- Director of Data Science Cloudera Twitter: @josh_wills