From user-return-788-apmail-crunch-user-archive=crunch.apache.org@crunch.apache.org Thu Apr 2 16:37:06 2015 Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B9A6B17490 for ; Thu, 2 Apr 2015 16:37:06 +0000 (UTC) Received: (qmail 99744 invoked by uid 500); 2 Apr 2015 16:36:54 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 99702 invoked by uid 500); 2 Apr 2015 16:36:53 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 99692 invoked by uid 99); 2 Apr 2015 16:36:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2015 16:36:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2015 16:36:49 +0000 Received: by qcgx3 with SMTP id x3so71521360qcg.3 for ; Thu, 02 Apr 2015 09:36:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=pBAkbeEFM7rXTHPOVqs8ReyudvVmIUcyPdyriq3lPcM=; b=aZf7TjVnZDlWPjKn9ZlMxwILRnQxLF8gTV3wxfBCkiGUzELZheyJRWtbyQFeiNe7rO RFVffU/AgN8vPytGRORseoAzcAamO8RAaZlpbvlqdU4/a/2ATPI27qWDBa1EWpKJpCQw bPaTG0o4fK6aaR+w2HtHBnMxxFzwxfB5PRm27frYGovJCrUs/a2yuGWrvwb0wRX2A+LQ 35bO8CgX/xBGZOr1sMYzzpwUiXBSZeCzL+M1x1nrxD9l1SFD3h8WZJpJ0CH3J3fpAg/I +mEB3BH8TyyZrv+ua0uIrbnafEvllPRZFDGSft+5rraBjhVR2Xn+bUiRS31FQhzMV2B6 ektw== X-Gm-Message-State: ALoCoQnkDqvcDJxP4ww9iDdzPMy9pzz2sfVJHna0T7IYkQzJUqyY+/izf1HqMEUBesGdWUxWu/EC X-Received: by 10.140.151.206 with SMTP id 197mr46300237qhx.0.1427992589027; Thu, 02 Apr 2015 09:36:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.159.85 with HTTP; Thu, 2 Apr 2015 09:36:08 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Thu, 2 Apr 2015 09:36:08 -0700 Message-ID: Subject: Re: Percentile rank To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a113551b60302550512c06fc2 X-Virus-Checked: Checked by ClamAV on apache.org --001a113551b60302550512c06fc2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I can't think of a great way to do it-- knowing exactly which record you're processing (in any kind of order) in a distributed processing job is always somewhat fraught. Gun to my head, I would do it in two phases: 1) Get the name of the FileSplit for the current task-- which can be retrieved, although we don't make it easy. You can do it via something like this from inside of a map-side DoFn: InputSplit split =3D ((MapContext) getContext()).getInputSplit(); FileSplit baseSplit =3D (FileSplit) ((Supplier) split).get(); The count up the number of records inside of each FileSplit. I'm not sure if you should disable combine files when you do this, but it seems like a good idea. 2) Create a new DoFn that takes the output of the previous job and uses it to determine exactly which record in order the currently processing record is, based on the sorted order of the FileSplit names and an internal counter that gets reset to zero for each new FileSplit. J On Thu, Apr 2, 2015 at 7:39 AM, Andr=C3=A9 Pinto wrote: > Hi, > > I'm trying to calculate the percentile ranks for the values of a sorted > PTable (i.e. at which % rank each element is within the whole data set). = Is > there a way to do this with Crunch? Seems that we would only need to have > access to the global index of the record during an iteration over the dat= a > set. > > Thanks in advance, > Andr=C3=A9 > > --=20 Director of Data Science Cloudera Twitter: @josh_wills --001a113551b60302550512c06fc2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I can't think of a great way to do it-- knowing exactl= y which record you're processing (in any kind of order) in a distribute= d processing job is always somewhat fraught. Gun to my head, I would do it = in two phases:

1) Get the name of the FileSplit for the = current task-- which can be retrieved, although we don't make it easy. = You can do it via something like this from inside of a map-side DoFn:
=

InputSplit split =3D ((MapContext) getContext()).getInp= utSplit();
FileSplit baseSplit =3D (FileSplit) ((Supplier<Inpu= tSplit>) split).get();

The count up the number = of records inside of each FileSplit. I'm not sure if you should disable= combine files when you do this, but it seems like a good idea.
<= br>
2) Create a new DoFn that takes the output of the previous jo= b and uses it to determine exactly which record in order the currently proc= essing record is, based on the sorted order of the FileSplit names and an i= nternal counter that gets reset to zero for each new FileSplit.
<= br>J

O= n Thu, Apr 2, 2015 at 7:39 AM, Andr=C3=A9 Pinto <andredasilvapin= to@gmail.com> wrote:
Hi,

I'm trying to calculate the percentile ranks for the values= of a sorted PTable (i.e. at which % rank each element is within the whole = data set). Is there a way to do this with Crunch? Seems that we would only = need to have access to the global index of the record during an iteration = over the data set.

Thanks in advance,
Andr=C3=A9
<= /div>




--
Director of Data Science
Twitter: @josh_wills
--001a113551b60302550512c06fc2--