From dev-return-14161-apmail-drill-dev-archive=drill.apache.org@drill.apache.org Wed Jun 10 02:10:36 2015 Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6F8A17ECA for ; Wed, 10 Jun 2015 02:10:36 +0000 (UTC) Received: (qmail 56366 invoked by uid 500); 10 Jun 2015 02:10:35 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 56307 invoked by uid 500); 10 Jun 2015 02:10:35 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 56296 invoked by uid 99); 10 Jun 2015 02:10:35 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 02:10:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 24826C0518 for ; Wed, 10 Jun 2015 02:10:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.02 X-Spam-Level: *** X-Spam-Status: No, score=3.02 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id L398FFHOdWqt for ; Wed, 10 Jun 2015 02:10:27 +0000 (UTC) Received: from mail-yk0-f175.google.com (mail-yk0-f175.google.com [209.85.160.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 3D5332315E for ; Wed, 10 Jun 2015 02:10:26 +0000 (UTC) Received: by ykfl8 with SMTP id l8so16725063ykf.1 for ; Tue, 09 Jun 2015 19:09:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=cHKYk3+IprtFPPt3jfkRnnmK0r2Ly600iowiHRwMNl4=; b=KSf7L4RNTAt6u9+EP8ShBwLnh2wlq+WBHQ4QZZKJcx89yy9MKJzHrLnbApi8qWxPKo t3KN9htbrVBkJBIKQJnhOIlqJgCNO1Pnd4VGhSk41yTihfyGkXNfjFiZI14xHWNeHuXU 90OgyuwXi5SnK61LxmlnjkuvR1gLp6j/+XWqyb9DxQ02qUoajb0dkiyOgxVJ90iGPGIU Sg9huMxPahVj6p95gxGVkB+5NoUedx6R8jehbciqYR1FIKK9yoQQ6uyQFgDGbQVxnXRS klTevXqBJWFrOUkILR3vc2kdMQAyA9ngpuF4fKD2yT3aVM4KgNQ4j7/omNHL4fe2tGVp hnaA== X-Gm-Message-State: ALoCoQnzZw0SyhJaR5pbKS6ZTSGNRsRtzLG4xHWluHfwlVRid0K4RfCKmoRWl2aflIdchs1h365g MIME-Version: 1.0 X-Received: by 10.129.39.205 with SMTP id n196mr817226ywn.55.1433902180060; Tue, 09 Jun 2015 19:09:40 -0700 (PDT) Received: by 10.129.95.136 with HTTP; Tue, 9 Jun 2015 19:09:40 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Jun 2015 19:09:40 -0700 Message-ID: Subject: Re: Window function query takes too long to complete and return results From: Abdel Hakim Deneche To: "dev@drill.apache.org" Content-Type: multipart/alternative; boundary=001a114092a215ea940518205e02 --001a114092a215ea940518205e02 Content-Type: text/plain; charset=UTF-8 please open a JIRA issue. please provide the test file (compressed) or a script to generate similar data. Thanks! On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz wrote: > Query that uses window functions takes too long to complete and return > results. It returns close to a million records, for which it took 533.8 > seconds ~8 minutes > Input CSV file has two columns, one integer and another varchar type > column. Please let me know if this needs to be investigated and I can > report a JIRA to track this if required ? > > Size of the input CSV file > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv > > -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 > /tmp/manyDuplicates.csv > > {code} > > select count(*) over(partition by cast(columns[1] as varchar(25)) order by > cast(columns[0] as bigint)) from `manyDuplicates.csv`; > > ... > > 1,000,007 rows selected (533.857 seconds) > {code} > > There are five distinct values in columns[1] in the CSV file. = [FIVE > PARTITIONS] > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from > `manyDuplicates.csv`; > > *+-----------------------+* > > *| ** EXPR$0 ** |* > > *+-----------------------+* > > *| *FFFFGGGGHHHHIIIIJJJJ * |* > > *| *PPPPQQQQRRRRSSSSTTTT * |* > > *| *AAAABBBBCCCCDDDDEEEE * |* > > *| *UUUUVVVVWWWWXXXXZZZZ * |* > > *| *KKKKLLLLMMMMNNNNOOOO * |* > > *+-----------------------+* > > 5 rows selected (1.906 seconds) > {code} > > Here is the count for each of those values in columns[1] > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *200484 * |* > > *+---------+* > > 1 row selected (0.961 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199353 * |* > > *+---------+* > > 1 row selected (0.86 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *200702 * |* > > *+---------+* > > 1 row selected (0.826 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199916 * |* > > *+---------+* > > 1 row selected (0.851 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199552 * |* > > *+---------+* > > 1 row selected (0.827 seconds) > {code} > > Thanks, > Khurram > -- Abdelhakim Deneche Software Engineer Now Available - Free Hadoop On-Demand Training --001a114092a215ea940518205e02--