drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Phillips <sphill...@maprtech.com>
Subject Re: Window function query takes too long to complete and return results
Date Wed, 10 Jun 2015 04:56:48 GMT
In cases like this where you are printing millions of record in SQLLINE,
you should pipe the output to /dev/null or to a file, and measure the
performance that way. I'm guessing that most of the time in this case is
spent printing the output to the console, and thus really unrelated to
Drill performance. If piping the data to a file or /dev/null causes the
query to run much faster, than it probably isn't a real issue.

also, anytime you are investigating a performance related issue, you should
always check the profile. In this case, I suspect you might see that most
of the time is spent in the WAIT time of the SCREEN operator. That would
indicate that client side processing is slowing the query down.

On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <adeneche@maprtech.com>
wrote:

> please open a JIRA issue. please provide the test file (compressed) or a
> script to generate similar data.
>
> Thanks!
>
> On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kfaraaz@maprtech.com>
> wrote:
>
> > Query that uses window functions takes too long to complete and return
> > results. It returns close to a million records, for which it took 533.8
> > seconds ~8 minutes
> > Input CSV file has two columns, one integer and another varchar type
> > column. Please let me know if this needs to be investigated and I can
> > report a JIRA to track this if required ?
> >
> > Size of the input CSV file
> >
> > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> >
> > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > /tmp/manyDuplicates.csv
> >
> > {code}
> >
> > select count(*) over(partition by cast(columns[1] as varchar(25)) order
> by
> > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> >
> > ...
> >
> > 1,000,007 rows selected (533.857 seconds)
> > {code}
> >
> > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > PARTITIONS]
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > `manyDuplicates.csv`;
> >
> > *+-----------------------+*
> >
> > *| **       EXPR$0        ** |*
> >
> > *+-----------------------+*
> >
> > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> >
> > *| *PPPPQQQQRRRRSSSSTTTT * |*
> >
> > *| *AAAABBBBCCCCDDDDEEEE * |*
> >
> > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> >
> > *| *KKKKLLLLMMMMNNNNOOOO * |*
> >
> > *+-----------------------+*
> >
> > 5 rows selected (1.906 seconds)
> > {code}
> >
> > Here is the count for each of those values in columns[1]
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *200484 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.961 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199353 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.86 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *200702 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.826 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199916 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.851 seconds)
> >
> > {code}
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> >
> > *+---------+*
> >
> > *| **EXPR$0 ** |*
> >
> > *+---------+*
> >
> > *| *199552 * |*
> >
> > *+---------+*
> >
> > 1 row selected (0.827 seconds)
> > {code}
> >
> > Thanks,
> > Khurram
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message