kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tenny susanto <tennysusa...@gmail.com>
Subject Re: kudu table design question
Date Fri, 24 Feb 2017 22:11:39 GMT
On my impala parquet table, each day partition is about 500MB - 1GB.


So using range partition by day, query time went down to 35 sec from 123 sec


Query against the impala table is 2 seconds.




On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <dan@cloudera.com> wrote:

> Hi Tenny,
>
> 1000 partitions is on the upper end of what I'd recommend - with 3x
> replication that's 125 tablet replicas per tablet server (something more
> like 20 or 30 would be ideal depending on hardware).  How much data does
> each day have?  I would aim for tablet size on the order of 50GiB, so if
> it's not that much per day you could try making week or month wide
> partitions.  Just bumping the number of partitions and being able to take
> advantage of partition pruning should improve the performance tremendously.
>
> In the next release we're adding support for pushdown IN list predicates,
> which could help your query even more if you could put company_id as the
> first component of your primary key.  That being said, I think improved
> range partition will likely give the most dramatic improvements, and
> there's no need to wait.
>
> Week wide range partitions can be specified like:
>
>  PARTITION 20170101 <= VALUES < 20170108,
>  PARTITION 20170108 <= VALUES < 20170115,
>  ...
>
>
> - Dan
>
> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <tennysusanto@gmail.com>
> wrote:
>
>> I have 24 tablet servers.
>>
>> I added an id column because I needed a unique column to be the primary
>> key as kudu required primary key to be specified.  My original table
>> actually has 20 columns with no single primary key column. I concatenated 5
>> of them to build a unique id column which I made it as part of the primary
>> key. I have tried specifying 5 columns to be the primary key but I noticed
>> the inserts were much slower, so I tried with just 2 columns as primary key
>> instead, seems to improve insert speed.
>>
>> So this is my new schema and will measure query speed with it. If I
>> partition by day, is 1000 partitions too many? What is the recommended
>> maximum limit in the number of partitions kudu can handle?
>>
>> CREATE TABLE kudu_fact_table  (
>> print_date_id,
>> id STRING,
>> company_id INT,
>> transcount INT)
>> PRIMARY KEY(print_date_id,id)
>> ) PARTITION BY RANGE (print_date_id)
>> (
>>   PARTITION VALUE = 20170101,
>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
>> year's worth of data)
>>  )
>> STORED AS KUDU
>> TBLPROPERTIES(
>>   'kudu.table_name' = 'kudu_fact_table',
>>   'kudu.master_addresses' = 'myserver:7051'
>> );
>>
>>
>>
>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> I'd add that moving the print_date_id to the beginning of the primary
>>> key in the Kudu fact table would allow each server to do a range scan
>>> instead of a full scan.
>>>
>>> -Todd
>>>
>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <dan@cloudera.com> wrote:
>>>
>>>> Hi Tenny,
>>>>
>>>> First off, how many tablet servers are in your cluster?  16 partitions
>>>> is appropriate for one or maybe two tablet servers, so if your cluster is
>>>> bigger you could try bumping the number of partitions.
>>>>
>>>> Second, the schemas don't look identical, you have an additional 'id'
>>>> column in the Kudu table, and crucially, it doesn't have any predicates,
so
>>>> this query is doing a full table scan.
>>>>
>>>> Finally, the Parquet table is likely able to take advantage of
>>>> significant partition pruning due to the between clause.  An equivalent in
>>>> Kudu would be range partitioning on the print_date_id.  You might try doing
>>>> the same for Kudu.
>>>>
>>>> - Dan
>>>>
>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <tennysusanto@gmail.com>
>>>> wrote:
>>>>
>>>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>>>
>>>>> I have an equivalent table in impala/parquet that is partitioned by
>>>>> day_id.
>>>>>
>>>>> create table impala_fact_table (
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> partitioned by
>>>>> (print_date_id INT)
>>>>> STORED AS PARQUET;
>>>>>
>>>>> so a common query would be:
>>>>>
>>>>> select  sum(transcount)
>>>>> from impala_fact_table f
>>>>> join with company_dim c on f.company_id = c.company_id
>>>>> where c.company_id in (123,456)
>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>
>>>>> I created an equivalent of the fact table in kudu:
>>>>>
>>>>> CREATE TABLE kudu_fact_table  (
>>>>> id STRING,
>>>>> print_date_id,
>>>>> company_id INT,
>>>>> transcount INT)
>>>>> PRIMARY KEY(id,print_date_id)
>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>> )
>>>>> STORED AS KUDU
>>>>> TBLPROPERTIES(
>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>> );
>>>>>
>>>>> But the performance of the join with this kudu table is terrible, 2
>>>>> secs with impala table vs 126 secs with kudu table.
>>>>>
>>>>> select  sum(transcount)
>>>>> from kudu_fact_table f
>>>>> join with company_dim c on f.company_id = c.company_id
>>>>> where c.company_id in (123,456)
>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>
>>>>> How should I design my kudu table so performance is somewhat
>>>>> comparable?
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Tenny Susanto
>>
>>
>


-- 
Regards,

Tenny Susanto

Mime
View raw message