drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Reshetov <alexander.v.reshe...@gmail.com>
Subject Performance degradation for UNION ALL parquet data sources
Date Thu, 08 Dec 2016 17:08:17 GMT
Hi,

I have one data file which I converted to parquet with CTAS.

It took about 35 seconds to execute next query:

select action['login'], count(*) from dfs.datastore.events group by
action['login'];

After splitting original source to 4 equal parts I created 4 view on
this parts (events_0, events_1, events_2, events_3):

create view dfs.datastore.events_combined as
select t0.`timestamp` as event_time, t0.client_id, t0.action from
dfs.datastore.events_0 t0
union all
select t1.`timestamp` as event_time, t1.client_id, t1.action from
dfs.datastore.events_1 t1
union all
select t2.`timestamp` as event_time, t2.client_id, t2.action from
dfs.datastore.events_2 t2
union all
select t3.`timestamp` as event_time, t3.client_id, t3.action from
dfs.datastore.events_3 t3;


When I make same query but on this view it executes much slower -
about 500 seconds.

select action['login'], count(*) from dfs.datastore.events_combined
group by action['login'];


I expected to see same execution time, but it degraded too much. What
could cause it and/or could it be solved somehow?

Mime
View raw message