kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sand Stone <sand.m.st...@gmail.com>
Subject Re: Partition and Split rows
Date Thu, 12 May 2016 18:39:04 GMT
Thanks, Dan.

In your scheme, I assume you suggest the range partition on the timestamp.
I don't know how Kudu load balance the data across the tablet servers. For
example, do I need to pre-calculate every day, a list of 5 minutes apart
timestamps at table creation? [assume I have to create a new table every
day].

My hope, with the additional 5-min column, and use it as the range
partition column, is that so I could spread the data evenly across the
tablet servers. Once the partition level deletion works, I don't need to
re-create a table.
Also, since 5-min interval data are always colocated together, the read
query could be efficient too.

P.S: there are some cases I would like to compute aggregations across all
metrics at 5-min intervals.


On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <dan@cloudera.com> wrote:

> Forgot to add the PK specification to the CREATE TABLE, it should have
> read as follows:
>
> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
> PRIMARY KEY (metric, time);
>
> - Dan
>
>
> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <dan@cloudera.com> wrote:
>
>>
>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sand.m.stone@gmail.com>
>> wrote:
>>
>>> > Is the requirement to pre-aggregate by time window?
>>> No, I am thinking to create a column say, "minute". It's basically the
>>> minute field of the timestamp column(even round to 5-min bucket depending
>>> on the needs). So it's a computed column being filled in on data ingestion.
>>> My goal is that this field would help with data filtering at read/query
>>> time, say select certain projection at minute 10-15, to speed up the read
>>> queries.
>>>
>>
>> In many cases, Kudu can do his for you without having to add special
>> columns.  The requirements are that the timestamp is part of the primary
>> key, and any columns that come before the timestamp in the primary key (if
>> it's a compound PK), have equality predicates.  So for instance, if you
>> create a table such as:
>>
>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>
>> then queries such as
>>
>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>
>> Then only the data for that 5 minute time window will be read from disk.
>> If the query didn't have the equality predicate on the 'metric' column,
>> then it would do a much bigger scan + filter operation.  If you want more
>> background on how this is achieved, check out the partition pruning design
>> doc:
>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>> .
>>
>> - Dan
>>
>>
>>
>>> Thanks for the info., I will follow them.
>>>
>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <dan@cloudera.com> wrote:
>>>
>>>> Hey Sand,
>>>>
>>>> Sorry for the delayed response.  I'm not quite following your use
>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>> Kudu can help you directly with that (nothing built in), but you could
>>>> always create a separate table to store the pre-aggregated values.  As far
>>>> as applying functions to do row splits, that is an interesting idea, but
I
>>>> think once Kudu has support for range bounds (the non-covering range
>>>> partition design doc linked above), you can simply create the bounds where
>>>> the function would have put them.  For example, if you want a partition for
>>>> every five minutes, you can create the bounds accordingly.
>>>>
>>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>>> some slides that may be interesting to you.  Additionally, you may want to
>>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>>> ideas.
>>>>
>>>> - Dan
>>>>
>>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.stone@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for sharing, Dan. The diagrams explained clearly how the
>>>>> current system works.
>>>>>
>>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like
I
>>>>> need to introduce a special 5-min bar column, use that column to do range
>>>>> partition to spread data across the tablet servers so that I could leverage
>>>>> parallel filtering.
>>>>>
>>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>>> whether it would be better to take "functions" as row split instead of
only
>>>>> constants. Of course if business requires to drop down to 1-min bar,
the
>>>>> data has to be re-sharded again. So a more cost effective way of doing
this
>>>>> on a production cluster would be good.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <dan@cloudera.com>
wrote:
>>>>>
>>>>>> Hi Sand,
>>>>>>
>>>>>> I've been working on some diagrams to help explain some of the more
>>>>>> advanced partitioning types, it's attached.   Still pretty rough
at this
>>>>>> point, but the goal is to clean it up and move it into the Kudu
>>>>>> documentation proper.  I'm interested to hear what kind of time series
you
>>>>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>>>>> series, you can follow progress here
>>>>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have
any
>>>>>> additional ideas I'd love to hear them.  You may also be interested
in a
>>>>>> small project that a JD and I have been working on in the past week
to
>>>>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>>>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature
>>>>>> limited at this point.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks. Will read.
>>>>>>>
>>>>>>> Given that I am researching time series data, row locality is
>>>>>>> crucial :-)
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <
>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>
>>>>>>>> We do have non-covering range partitions coming in the next
few
>>>>>>>> months, here's the design (in review):
>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>>>>
>>>>>>>> The "Background & Motivation" section should give you
a good idea
>>>>>>>> of why I'm mentioning this.
>>>>>>>>
>>>>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>>>>> could be good enough.
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.stone@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Makes sense.
>>>>>>>>>
>>>>>>>>> Yeah it would be cool if users could specify/control
the split
>>>>>>>>> rows after the table is created. Now, I have to "think
ahead" to pre-create
>>>>>>>>> the range buckets.
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcryans@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> You will only get 1 tablet and no data distribution,
which is bad.
>>>>>>>>>>
>>>>>>>>>> That's also how HBase works, but it will split regions
as you
>>>>>>>>>> insert data and eventually you'll get some data distribution
even if it
>>>>>>>>>> doesn't start in an ideal situation. Tablet splitting
will come later for
>>>>>>>>>> Kudu.
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <
>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> One more questions, how does the range partition
work if I don't
>>>>>>>>>>> specify the split rows?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <
>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Misty. The "advanced" impala example
helped.
>>>>>>>>>>>>
>>>>>>>>>>>> I was just reading the Java API,CreateTableOptions.java,
it's
>>>>>>>>>>>> unclear how the range partition column names
associated with the partial
>>>>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones
<
>>>>>>>>>>>> mstanleyjones@cloudera.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Sand,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please have a look at
>>>>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Misty
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand
Stone <
>>>>>>>>>>>>> sand.m.stone@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, I am new to Kudu. I wonder how
the split rows work. I
>>>>>>>>>>>>>> know from some docs, this is currently
for pre-creation the table. I am
>>>>>>>>>>>>>> researching how to partition (hash+range)
some time series test data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there an example? or notes somewhere
I could read upon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks much.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message