kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From davidral...@gmail.com
Subject Re: kudu table design question
Date Fri, 10 Mar 2017 23:25:32 GMT
Hi Tenny

  I see in the plan that the kudu tables are missing statistics, so a first step would be
to compute statistics a check the result.
  Also From the plan I see this particular query is only scanning from 3 kudu hosts, vs 24
in parquet. If queries are bound to scan a small range maybe consider adding hash partitioning
with coarser grained range partitions.

Best
David

Sent from my iPhone

> On Mar 10, 2017, at 3:05 PM, tenny susanto <tennysusanto@gmail.com> wrote:
> 
> Ahh..no, I did not configure any compression on the kudu table. I will go check the docs
and enable compression.
> 
> In the meantime, here's the query profiles against impala parquet table vs my existing
no compression kudu table.
> 
> 
>> On Fri, Mar 10, 2017 at 11:50 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> Hi Tenny,
>> 
>> Sorry for the delay on this thread. Just wanted to check in and find out how the
experiments are going.
>> 
>> Do you have query profiles of the query against Parquet and the same query against
Kudu? The 15x difference you reported is not expected.
>> 
>> I didn't see it mentioned above in the thread: did you configure any encoding or
compression on the Kudu tables? This is often a source of major performance differences.
>> 
>> -Todd
>> 
>>> On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <tennysusanto@gmail.com>
wrote:
>>> On my impala parquet table, each day partition is about 500MB - 1GB.
>>> 
>>> 
>>> So using range partition by day, query time went down to 35 sec from 123 sec
>>> 
>>> 
>>> Query against the impala table is 2 seconds.
>>> 
>>> 
>>> 
>>> 
>>>> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <dan@cloudera.com> wrote:
>>>> Hi Tenny,
>>>> 
>>>> 1000 partitions is on the upper end of what I'd recommend - with 3x replication
that's 125 tablet replicas per tablet server (something more like 20 or 30 would be ideal
depending on hardware).  How much data does each day have?  I would aim for tablet size on
the order of 50GiB, so if it's not that much per day you could try making week or month wide
partitions.  Just bumping the number of partitions and being able to take advantage of partition
pruning should improve the performance tremendously.
>>>> 
>>>> In the next release we're adding support for pushdown IN list predicates,
which could help your query even more if you could put company_id as the first component of
your primary key.  That being said, I think improved range partition will likely give the
most dramatic improvements, and there's no need to wait.
>>>> 
>>>> Week wide range partitions can be specified like:
>>>> 
>>>>  PARTITION 20170101 <= VALUES < 20170108,
>>>>  PARTITION 20170108 <= VALUES < 20170115,
>>>>  ...
>>>> 
>>>> 
>>>> - Dan
>>>> 
>>>>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <tennysusanto@gmail.com>
wrote:
>>>>> I have 24 tablet servers.
>>>>> 
>>>>> I added an id column because I needed a unique column to be the primary
key as kudu required primary key to be specified.  My original table actually has 20 columns
with no single primary key column. I concatenated 5 of them to build a unique id column which
I made it as part of the primary key. I have tried specifying 5 columns to be the primary
key but I noticed the inserts were much slower, so I tried with just 2 columns as primary
key instead, seems to improve insert speed.
>>>>> 
>>>>> So this is my new schema and will measure query speed with it. If I partition
by day, is 1000 partitions too many? What is the recommended maximum limit in the number of
partitions kudu can handle?
>>>>> 
>>>>> CREATE TABLE kudu_fact_table  (
>>>>> print_date_id,
>>>>> id STRING,
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> PRIMARY KEY(print_date_id,id)
>>>>> ) PARTITION BY RANGE (print_date_id) 
>>>>> (
>>>>>   PARTITION VALUE = 20170101,
>>>>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have
3 year's worth of data)
>>>>>  )
>>>>> STORED AS KUDU
>>>>> TBLPROPERTIES(
>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>> );
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <todd@cloudera.com>
wrote:
>>>>>> I'd add that moving the print_date_id to the beginning of the primary
key in the Kudu fact table would allow each server to do a range scan instead of a full scan.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <dan@cloudera.com>
wrote:
>>>>>>> Hi Tenny,
>>>>>>> 
>>>>>>> First off, how many tablet servers are in your cluster?  16 partitions
is appropriate for one or maybe two tablet servers, so if your cluster is bigger you could
try bumping the number of partitions.
>>>>>>> 
>>>>>>> Second, the schemas don't look identical, you have an additional
'id' column in the Kudu table, and crucially, it doesn't have any predicates, so this query
is doing a full table scan.
>>>>>>> 
>>>>>>> Finally, the Parquet table is likely able to take advantage of
significant partition pruning due to the between clause.  An equivalent in Kudu would be range
partitioning on the print_date_id.  You might try doing the same for Kudu.
>>>>>>> 
>>>>>>> - Dan
>>>>>>> 
>>>>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com>
wrote:
>>>>>>>> I have a table (call this fact_table)  that I want to create
in kudu. 
>>>>>>>> 
>>>>>>>> I have an equivalent table in impala/parquet that is partitioned
by day_id. 
>>>>>>>> 
>>>>>>>> create table impala_fact_table (
>>>>>>>> company_id INT,
>>>>>>>> transcount INT)
>>>>>>>> partitioned by 
>>>>>>>> (print_date_id INT)
>>>>>>>> STORED AS PARQUET;
>>>>>>>> 
>>>>>>>> so a common query would be:
>>>>>>>> 
>>>>>>>> select  sum(transcount)
>>>>>>>> from impala_fact_table f
>>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>>> where c.company_id in (123,456)
>>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>> 
>>>>>>>> I created an equivalent of the fact table in kudu:
>>>>>>>> 
>>>>>>>> CREATE TABLE kudu_fact_table  (
>>>>>>>> id STRING,
>>>>>>>> print_date_id,
>>>>>>>> company_id INT,
>>>>>>>> transcount INT)
>>>>>>>> PRIMARY KEY(id,print_date_id)
>>>>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>>>>> )
>>>>>>>> STORED AS KUDU
>>>>>>>> TBLPROPERTIES(
>>>>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>>>>> );
>>>>>>>> 
>>>>>>>> But the performance of the join with this kudu table is terrible,
2 secs with impala table vs 126 secs with kudu table. 
>>>>>>>> 
>>>>>>>> select  sum(transcount)
>>>>>>>> from kudu_fact_table f
>>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>>> where c.company_id in (123,456)
>>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>> 
>>>>>>>> How should I design my kudu table so performance is somewhat
comparable?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Regards,
>>>>> 
>>>>> Tenny Susanto
>>> 
>>> 
>>> 
>>> -- 
>>> Regards,
>>> 
>>> Tenny Susanto
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> -- 
> Regards,
> 
> Tenny Susanto 
> 
> <benchmark_impala_parquet.txt>
> <benchmark_kudu_range_partition.txt>

Mime
View raw message