drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Memory usage
Date Tue, 04 Aug 2015 16:44:40 GMT
See if you can bump your heap to 2G.

I have not worked on Windows embedded, but there should be no reason why setting the node
width to 1 should not work. Used on Linux and Mac with no issue to create parquet.

Also decrease the parquet block size to 32MB on such a small system if you keep running into
memory issues. You can also turn off HASH JOIN to limit memory usage (planner.enable_hashjoin),
but it may slow down the query. HASH JOIN and Parquet writer can consume a lot of memory.

On the last query which columns do you want, a select * on a join may not always be ideal
with CTAS. Without knowing the data it is hard to follow what the desired outcome is.

—Andries


> On Aug 4, 2015, at 9:05 AM, Boris Chmiel <boris.chmiel@yahoo.com.INVALID> wrote:
> 
> Hi Andries,
> I am using Drill 1.1.0Configuration is : DRILL_MAX_DIRECT_MEMORY="4G"DRILL_HEAP="1G"planner.memory.max_query_memory_per_node
is 4147483648Physical RAM is 8G. The computer is dedicated to test Drill (fresh Win install)
> However total.max peak at 2 019 033 088 within Metrics 
> During the intial query : - 6 Minor Fragments in the PARQUET_WRITER are instantiated
- The query fails before starting writing- 3 Minor Fragments for each of the 2 PARQUET_ROW_GROUP_SCAN
: The first did not start and the Second fails with peak memory at 125MB + 121MB + 21MB
> Re running the query while Dropping planner.width.max_per_node from 3 to 1 causes : -
to have only 1 minor fragment for each PARQUET_ROW_GROUP_SCAN operators- to start both PARQUET_ROW_GROUP_SCAN
operators (32K row read over 760K and 760K over 4840K)- to make the query fails with PARQUET_WRITER
and HASH_JOIN initiated (Major Fragment 1)
> The total peak memory usage within the plan is : - 57MB for PARQUET_WRITER- 109MB for
HASH_JOIN- 170MB for PARQUET_ROW_GROUP_SCAN #1- 360MB for PARQUET_ROW_GROUP_SCAN #2- 25MB
for a PROJECT operators=> 721MB peak
> Do you think my configuration is not appropriate to what I'm trying to do ? I am definitely
limited by physical memory ?
> Thanks 
> RegsBoris
> 
> 
>     Le Mardi 4 août 2015 17h10, Andries Engelbrecht <aengelbrecht@maprtech.com>
a écrit :
> 
> 
> How much memory is allocated to Drill in the drill-env.sh file?
> 
> CTAS with parquet can consume quite a bit of memory as various structures are allocated
in memory before the parquet files are written. If you look in the query profiles you will
get a good indication of the memory usage.
> 
> Also see how many fragments are working on creating the parquet files, if you are limited
on memory you can reduce the number of fragments in CTAS to limit memory usage.
> You can check planner.width.max_per_node and reduce the number if it is higher than 1.

> 
> Which version of Drill are you using?
> 
> —Andries
> 
> 
>> On Aug 4, 2015, at 7:50 AM, Boris Chmiel <boris.chmiel@yahoo.com.INVALID> wrote:
>> 
>> Hi all,
>> 
>>   I try to figureout how to optimize my queries. I found that when I prepare my data
prior toquery it, using CTAS to apply schema and transform my CSV files to Parquetformat,
subsequent queries are much likely to reach OOM. 
>> 
>> i.e :
>> 
>>   This direct queryon csv files works: 
>> 
>> CREATE TABLEt3parquet as (
>> 
>> SELECT * FROMTable1.csv
>> 
>> INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);
>> 
>>   When thiscombination does not: 
>> 
>> CREATE TABLEt1parquet AS (
>> 
>> SELECT 
>> 
>> CAST(columns[0] ASvarchar(10)) key1)
>> 
>> CAST(columns[1] …and so on)
>> 
>> FROM Table1.csv);
>> 
>> 
>>   
>> CREATE TABLE t2parquetAS (
>> 
>> SELECT CAST(columns[0]AS varchar(10)) key1)
>> 
>> CAST(columns[1] …and so on)
>> 
>> FROM Table2.csv);
>> 
>> 
>>   
>> CREATE TABLE t3parquet as (
>> 
>> SELECT * FROM t2parquet 
>> 
>> INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);
>> 
>> 
>>   
>> This last query runs OOM on PARQUET_ROW_GROUP_SCAN
>> 
>> 
>>   
>> I use embedded mode upon Windows, File system storage,64MB parquet block size, not
so big files (less hundreds of MB in raw format) 
>> 
>> 
>>   
>> 
>>   
>> Does the way Drill / Parquet work implies to prefer queries/ views on raw files to
save memory rather than parquet ? Does this behavior isnormal ?
>> 
>> Do you think I my memory configuration should by tunedor does I miss understand something
?
>> 
>> 
>>   
>> Thanks in advance, and sorry for my english
>> 
>> Regards
>> 
>> Boris
>> 
>> 
> 
> 


Mime
View raw message