drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Chmiel <boris.chm...@yahoo.com.INVALID>
Subject Re: Memory usage
Date Tue, 04 Aug 2015 16:05:48 GMT
Hi Andries,
I am using Drill 1.1.0Configuration is : DRILL_MAX_DIRECT_MEMORY="4G"DRILL_HEAP="1G"planner.memory.max_query_memory_per_node
is 4147483648Physical RAM is 8G. The computer is dedicated to test Drill (fresh Win install)
However total.max peak at 2 019 033 088 within Metrics 
During the intial query : - 6 Minor Fragments in the PARQUET_WRITER are instantiated - The
query fails before starting writing- 3 Minor Fragments for each of the 2 PARQUET_ROW_GROUP_SCAN
: The first did not start and the Second fails with peak memory at 125MB + 121MB + 21MB
Re running the query while Dropping planner.width.max_per_node from 3 to 1 causes : - to
have only 1 minor fragment for each PARQUET_ROW_GROUP_SCAN operators- to start both PARQUET_ROW_GROUP_SCAN
operators (32K row read over 760K and 760K over 4840K)- to make the query fails with PARQUET_WRITER
and HASH_JOIN initiated (Major Fragment 1)
The total peak memory usage within the plan is : - 57MB for PARQUET_WRITER- 109MB for HASH_JOIN-
170MB for PARQUET_ROW_GROUP_SCAN #1- 360MB for PARQUET_ROW_GROUP_SCAN #2- 25MB for a PROJECT
operators=> 721MB peak
Do you think my configuration is not appropriate to what I'm trying to do ? I am definitely
limited by physical memory ?
 Thanks 
RegsBoris


     Le Mardi 4 août 2015 17h10, Andries Engelbrecht <aengelbrecht@maprtech.com> a
écrit :
   

 How much memory is allocated to Drill in the drill-env.sh file?

CTAS with parquet can consume quite a bit of memory as various structures are allocated in
memory before the parquet files are written. If you look in the query profiles you will get
a good indication of the memory usage.

Also see how many fragments are working on creating the parquet files, if you are limited
on memory you can reduce the number of fragments in CTAS to limit memory usage.
You can check planner.width.max_per_node and reduce the number if it is higher than 1. 

Which version of Drill are you using?

—Andries


> On Aug 4, 2015, at 7:50 AM, Boris Chmiel <boris.chmiel@yahoo.com.INVALID> wrote:
> 
> Hi all,
> 
>  I try to figureout how to optimize my queries. I found that when I prepare my data
prior toquery it, using CTAS to apply schema and transform my CSV files to Parquetformat,
subsequent queries are much likely to reach OOM. 
> 
> i.e :
> 
>  This direct queryon csv files works: 
> 
> CREATE TABLEt3parquet as (
> 
> SELECT * FROMTable1.csv
> 
> INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);
> 
>  When thiscombination does not: 
> 
> CREATE TABLEt1parquet AS (
> 
> SELECT 
> 
> CAST(columns[0] ASvarchar(10)) key1)
> 
> CAST(columns[1] …and so on)
> 
> FROM Table1.csv);
> 
> 
>  
> CREATE TABLE t2parquetAS (
> 
> SELECT CAST(columns[0]AS varchar(10)) key1)
> 
> CAST(columns[1] …and so on)
> 
> FROM Table2.csv);
> 
> 
>  
> CREATE TABLE t3parquet as (
> 
> SELECT * FROM t2parquet 
> 
> INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);
> 
> 
>  
> This last query runs OOM on PARQUET_ROW_GROUP_SCAN
> 
> 
>  
> I use embedded mode upon Windows, File system storage,64MB parquet block size, not so
big files (less hundreds of MB in raw format) 
> 
> 
>  
> 
>  
> Does the way Drill / Parquet work implies to prefer queries/ views on raw files to save
memory rather than parquet ? Does this behavior isnormal ?
> 
> Do you think I my memory configuration should by tunedor does I miss understand something
?
> 
> 
>  
> Thanks in advance, and sorry for my english
> 
> Regards
> 
> Boris
> 
> 


  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message