drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francis McGregor-Macdonald <fran...@mc-mac.com>
Subject Query Optimization and Constraints
Date Wed, 03 Jan 2018 22:07:24 GMT
Hi,

I am running a query:

CREATE TABLE `s3://temp.parquet`
PARTITION BY (`fact`)
AS
WITH facts AS (
    SELECT col1, col2, col3
    FROM TABLE(`s3://...facts.csv`
     (type => 'text', fieldDelimiter => '|', extractHeader => true))
    WHERE ...
)
,dimension1 AS (
    SELECT col1, col2, col3
    FROM TABLE(`s3://...demension1.csv`
     (type => 'text', fieldDelimiter => '|', extractHeader => true))
)
,dimension2 AS (
    SELECT col1, col2, col3
    FROM TABLE(`s3://...demension2.csv`
     (type => 'text', fieldDelimiter => '|', extractHeader => true))
)
SELECT
    `col1, co2, etc
FROM facts
INNER JOIN ...
INNER JOIN ...

Drill is 1.12.0 running on AWS EMR (3 x r4.2xlarge, I would add more nodes
but as per below is only running a single fragment). Other queries do run
correctly with multiple fragments.

I have a couple of questions please:
1. The only way I can get the query to use all available memory (as defined
by planner.memory.max_query_memory_per_node = n) is to
set planner.memory.min_memory_per_buffered_op = n (i.e. the same as
planner.memory.max_query_memory_per_node. I can only see limited
documentation on the set planner.memory.min_memory_per_buffered_op, is this
expected behaviour?

2. The query irrespective of the file sizes always runs as a single thread.
Given the files range from ~100MB to ~100GB this means it can take
significant time (hours). Is this due to a limitation of querying data from
s3 or can this be optimized?

I have been setting the following based on documents I have reviewed on
optimizing Drill:
a. exec.min_hash_table_size
b. planner.width.max_per_node
c. planner.memory.max_query_memory_per_node
d. planner.memory.min_memory_per_buffered_op
e. planner.memory_limit

Is there any other likely "constraint" I should be looking at?

Is there a"query constraints check" flow process I could be working through
to optimize my query (I haven't found one)? If there isn't a generic one I
will happy write on up based on my working.

Regards,
Francis

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message