kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tenny susanto <tennysusa...@gmail.com>
Subject Re: kudu table design question
Date Fri, 10 Mar 2017 23:05:49 GMT
Ahh..no, I did not configure any compression on the kudu table. I will go
check the docs and enable compression.

In the meantime, here's the query profiles against impala parquet table vs
my existing no compression kudu table.


On Fri, Mar 10, 2017 at 11:50 AM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Tenny,
>
> Sorry for the delay on this thread. Just wanted to check in and find out
> how the experiments are going.
>
> Do you have query profiles of the query against Parquet and the same query
> against Kudu? The 15x difference you reported is not expected.
>
> I didn't see it mentioned above in the thread: did you configure any
> encoding or compression on the Kudu tables? This is often a source of major
> performance differences.
>
> -Todd
>
> On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <tennysusanto@gmail.com>
> wrote:
>
>> On my impala parquet table, each day partition is about 500MB - 1GB.
>>
>>
>> So using range partition by day, query time went down to 35 sec from 123
>> sec
>>
>>
>> Query against the impala table is 2 seconds.
>>
>>
>>
>>
>> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <dan@cloudera.com> wrote:
>>
>>> Hi Tenny,
>>>
>>> 1000 partitions is on the upper end of what I'd recommend - with 3x
>>> replication that's 125 tablet replicas per tablet server (something more
>>> like 20 or 30 would be ideal depending on hardware).  How much data does
>>> each day have?  I would aim for tablet size on the order of 50GiB, so if
>>> it's not that much per day you could try making week or month wide
>>> partitions.  Just bumping the number of partitions and being able to take
>>> advantage of partition pruning should improve the performance tremendously.
>>>
>>> In the next release we're adding support for pushdown IN list
>>> predicates, which could help your query even more if you could put
>>> company_id as the first component of your primary key.  That being said, I
>>> think improved range partition will likely give the most dramatic
>>> improvements, and there's no need to wait.
>>>
>>> Week wide range partitions can be specified like:
>>>
>>>  PARTITION 20170101 <= VALUES < 20170108,
>>>  PARTITION 20170108 <= VALUES < 20170115,
>>>  ...
>>>
>>>
>>> - Dan
>>>
>>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <tennysusanto@gmail.com>
>>> wrote:
>>>
>>>> I have 24 tablet servers.
>>>>
>>>> I added an id column because I needed a unique column to be the primary
>>>> key as kudu required primary key to be specified.  My original table
>>>> actually has 20 columns with no single primary key column. I concatenated
5
>>>> of them to build a unique id column which I made it as part of the primary
>>>> key. I have tried specifying 5 columns to be the primary key but I noticed
>>>> the inserts were much slower, so I tried with just 2 columns as primary key
>>>> instead, seems to improve insert speed.
>>>>
>>>> So this is my new schema and will measure query speed with it. If I
>>>> partition by day, is 1000 partitions too many? What is the recommended
>>>> maximum limit in the number of partitions kudu can handle?
>>>>
>>>> CREATE TABLE kudu_fact_table  (
>>>> print_date_id,
>>>> id STRING,
>>>> company_id INT,
>>>> transcount INT)
>>>> PRIMARY KEY(print_date_id,id)
>>>> ) PARTITION BY RANGE (print_date_id)
>>>> (
>>>>   PARTITION VALUE = 20170101,
>>>>   PARTITION VALUE = 20170102 ... (1 partition for each day, and I have
>>>> 3 year's worth of data)
>>>>  )
>>>> STORED AS KUDU
>>>> TBLPROPERTIES(
>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>> );
>>>>
>>>>
>>>>
>>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>>
>>>>> I'd add that moving the print_date_id to the beginning of the primary
>>>>> key in the Kudu fact table would allow each server to do a range scan
>>>>> instead of a full scan.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <dan@cloudera.com>
wrote:
>>>>>
>>>>>> Hi Tenny,
>>>>>>
>>>>>> First off, how many tablet servers are in your cluster?  16
>>>>>> partitions is appropriate for one or maybe two tablet servers, so
if your
>>>>>> cluster is bigger you could try bumping the number of partitions.
>>>>>>
>>>>>> Second, the schemas don't look identical, you have an additional
'id'
>>>>>> column in the Kudu table, and crucially, it doesn't have any predicates,
so
>>>>>> this query is doing a full table scan.
>>>>>>
>>>>>> Finally, the Parquet table is likely able to take advantage of
>>>>>> significant partition pruning due to the between clause.  An equivalent
in
>>>>>> Kudu would be range partitioning on the print_date_id.  You might
try doing
>>>>>> the same for Kudu.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <
>>>>>> tennysusanto@gmail.com> wrote:
>>>>>>
>>>>>>> I have a table (call this fact_table)  that I want to create
in
>>>>>>> kudu.
>>>>>>>
>>>>>>> I have an equivalent table in impala/parquet that is partitioned
by
>>>>>>> day_id.
>>>>>>>
>>>>>>> create table impala_fact_table (
>>>>>>> company_id INT,
>>>>>>> transcount INT)
>>>>>>> partitioned by
>>>>>>> (print_date_id INT)
>>>>>>> STORED AS PARQUET;
>>>>>>>
>>>>>>> so a common query would be:
>>>>>>>
>>>>>>> select  sum(transcount)
>>>>>>> from impala_fact_table f
>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>> where c.company_id in (123,456)
>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>
>>>>>>> I created an equivalent of the fact table in kudu:
>>>>>>>
>>>>>>> CREATE TABLE kudu_fact_table  (
>>>>>>> id STRING,
>>>>>>> print_date_id,
>>>>>>> company_id INT,
>>>>>>> transcount INT)
>>>>>>> PRIMARY KEY(id,print_date_id)
>>>>>>> ) PARTITION BY HASH PARTITIONS 16
>>>>>>> )
>>>>>>> STORED AS KUDU
>>>>>>> TBLPROPERTIES(
>>>>>>>   'kudu.table_name' = 'kudu_fact_table',
>>>>>>>   'kudu.master_addresses' = 'myserver:7051'
>>>>>>> );
>>>>>>>
>>>>>>> But the performance of the join with this kudu table is terrible,
2
>>>>>>> secs with impala table vs 126 secs with kudu table.
>>>>>>>
>>>>>>> select  sum(transcount)
>>>>>>> from kudu_fact_table f
>>>>>>> join with company_dim c on f.company_id = c.company_id
>>>>>>> where c.company_id in (123,456)
>>>>>>> and f.print_date_id between 20170101 and 20170202
>>>>>>>
>>>>>>> How should I design my kudu table so performance is somewhat
>>>>>>> comparable?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Tenny Susanto
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Tenny Susanto
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Regards,

Tenny Susanto

Mime
View raw message