drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Chmiel <boris.chm...@yahoo.com.INVALID>
Subject Memory usage
Date Tue, 04 Aug 2015 14:50:09 GMT
Hi all,

 I try to figureout how to optimize my queries. I found that when I prepare my data prior
toquery it, using CTAS to apply schema and transform my CSV files to Parquetformat, subsequent
queries are much likely to reach OOM. 

i.e :

 This direct queryon csv files works: 

CREATE TABLEt3parquet as (

SELECT * FROMTable1.csv

INNER JOINTable2.csv ON table1.columns [0] = table2.columns[0]);

 When thiscombination does not: 

CREATE TABLEt1parquet AS (


CAST(columns[0] ASvarchar(10)) key1)

CAST(columns[1] …and so on)

FROM Table1.csv);

CREATE TABLE t2parquetAS (

SELECT CAST(columns[0]AS varchar(10)) key1)

CAST(columns[1] …and so on)

FROM Table2.csv);

CREATE TABLE t3parquet as (

SELECT * FROM t2parquet 

INNER JOIN t1parquet ON t1parquet.key1 =t2parquet.key1);

This last query runs OOM on PARQUET_ROW_GROUP_SCAN

I use embedded mode upon Windows, File system storage,64MB parquet block size, not so big
files (less hundreds of MB in raw format) 


Does the way Drill / Parquet work implies to prefer queries/ views on raw files to save memory
rather than parquet ? Does this behavior isnormal ?

Do you think I my memory configuration should by tunedor does I miss understand something

Thanks in advance, and sorry for my english



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message