kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <...@cloudera.com>
Subject Re: Multi-level partitions question
Date Thu, 11 Oct 2018 22:25:24 GMT
> Just to clarify, are you saying that partition by hash(shop_id),
hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
totally equivalent in terms of having tablets being skewed and there is no
difference at all?

Yes, that's correct, there is no difference at all in any metric, including
skew, read, or write performance.

> I thought it would be more like with Hive/Impala HDFS, then nested
partitions are grouped under different parent partitions. Does Kudu store
them as a bunch of independent files instead and each file will have data
for the specific hash of shop_id/customer_id?

That's correct, there is no parent/child relationship.  The only minor
clarification is that the individual hash pairs are a tablet, not a file.

- Dan

On Thu, Oct 11, 2018 at 1:31 PM Boris Tyukin <boris@boristyukin.com> wrote:

> Hi Dan,
>
> thanks for replying so quickly. Unfortunately, customer_id per shop is
> indeed very skewed (some shops have way more customers than others) and I
> was tempted to do  hash(shop_id, customer_id) but because 40% of queries
> won't use customer_id, it would cause a full scan. Actually to be accurate,
> then we join tables in Impala, first join might be limited on shop_id,
> while other tables will be joined by shop_id and customer_id. Of course,
> that first table will be very large (billions of rows).
>
> Just to clarify, are you saying that partition by hash(shop_id),
> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
> totally equivalent in terms of having tablets being skewed and there is no
> difference at all?
>
> I thought it would be more like with Hive/Impala HDFS, then nested
> partitions are grouped under different parent partitions. Does Kudu store
> them as a bunch of independent files instead and each file will have data
> for the specific hash of shop_id/customer_id?
>
> Boris
>
> On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <danburkert@apache.org> wrote:
>
>> Hi Boris,
>>
>> The two examples you gave are exactly equivalent; the relative ordering
>> of hash levels has no effect on query performance, hotspotting, or anything
>> else.  Given that 60% of your queries don't specify a specific customer_id,
>> it does make sense to use hash(shop_id), hash(customer_id) instead of
>> combining them in a single hash level as hash(shop_id, customer_id),
>> however the trade-off is that the hotspotting resistance isn't as good.  If
>> the shop_id and customer_id columns aren't skewed to begin with that's not
>> a concern, though.
>>
>> - Dan
>>
>> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <boris@boristyukin.com>
>> wrote:
>>
>>> Hi guys,
>>> Read this doc
>>> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
>>> and I have a question on this particular statement
>>> "Scans on multilevel partitioned tables can take advantage of partition
>>> pruning on any of the levels independently"
>>>
>>> Does it mean, that both strategies below would be equivalent in terms of
>>> performance (i.e. minimum scans)
>>>
>>> partition by hash(shop_id), hash(customer_id)
>>> vs.
>>> partition by hash(customer_id), hash(shop_id)
>>>
>>> 60% of the queries are using both shop_id and customer_id but 40% of
>>> queries need to pull all customers for a specific shop_id. And almost never
>>> by customer_id alone (customer_id is not unique across shops and is
>>> assigned per shop).
>>>
>>> At the same time, if I partition by customer_id first,  partitions will
>>> be distributed more evenly.
>>>
>>> Thanks!
>>> Boris
>>>
>>>
>>>
>>>

Mime
View raw message