kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tenny susanto <tennysusa...@gmail.com>
Subject kudu table design question
Date Fri, 24 Feb 2017 01:08:24 GMT
I have a table (call this fact_table)  that I want to create in kudu.

I have an equivalent table in impala/parquet that is partitioned by day_id.

create table impala_fact_table (
company_id INT,
transcount INT)
partitioned by
(print_date_id INT)
STORED AS PARQUET;

so a common query would be:

select  sum(transcount)
from impala_fact_table f
join with company_dim c on f.company_id = c.company_id
where c.company_id in (123,456)
and f.print_date_id between 20170101 and 20170202

I created an equivalent of the fact table in kudu:

CREATE TABLE kudu_fact_table  (
id STRING,
print_date_id,
company_id INT,
transcount INT)
PRIMARY KEY(id,print_date_id)
) PARTITION BY HASH PARTITIONS 16
)
STORED AS KUDU
TBLPROPERTIES(
  'kudu.table_name' = 'kudu_fact_table',
  'kudu.master_addresses' = 'myserver:7051'
);

But the performance of the join with this kudu table is terrible, 2 secs
with impala table vs 126 secs with kudu table.

select  sum(transcount)
from kudu_fact_table f
join with company_dim c on f.company_id = c.company_id
where c.company_id in (123,456)
and f.print_date_id between 20170101 and 20170202

How should I design my kudu table so performance is somewhat comparable?

Mime
View raw message