drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Drill query planning taking a LONG time
Date Mon, 13 Feb 2017 20:05:29 GMT
We've just recently started working with Drill and I'm seeing something
that doesn't seem right and I'm not sure how to troubleshoot. We have 100
Parquet files which are each about 400MB each using Snappy compression.
While trying to query this data I am seeing extraordinary planning time for
certain queries. The planning time is about 12 minutes and the actual
execution of the query is less than 2 minutes.

A few details of our setup. We are running Drill on an AWS EMR cluster on
m4.16xlarge nodes (64 cores, 256GB each). We've given Drill an 8GB Java
heap and 100GB Java direct memory. We have verified that the metadata cache
file is being created and used. We have tried a cluster of 10 nodes and a
cluster of 2 nodes with no difference in planning time or execution time.

Does anyone have some pointers on troubleshooting excessive planning time?
It seems like we must have something misconfigured or are missing
something. We're very new to Drill and I think I've exhausted all my
troubleshooting ideas so far. Any tips anyone can provide?

Here is the main query I've been experimenting with, so you can get a feel
for the query complexity:

select fltb1.sapId, yearmo,
COUNT(*) as totalcnt,
               THEN fltb1.invoiceId
               END)) as ultracount,
count(distinct (CASE
                THEN fltb1.invoiceId
                END)) as labcount
from (select sapid, invoiceId, TO_CHAR(TO_TIMESTAMP(transactionDate,
'YYYY-MM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM') yearmo, classLabels
      from s3.cisexport.transactionView) fltb1
group by fltb1.sapId, yearmo;



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message