hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Stepachev <oct...@gmail.com>
Subject Re: Disk Seeks and Column families
Date Tue, 24 Jan 2012 06:51:27 GMT
2012/1/24 Praveen Sripati <praveensripati@gmail.com>:
> Thanks for the response. I am just getting started with HBase. And before
> getting into the code/api level details, I am trying to understand the
> problem area HBase is trying to address through it's architecture/design.
>
> 1) So, what are the recommendations for having many columns and with dense
> data? Is HBase not the right tool?

Split them by prefixing keys. (i.e. key->a,b,c => a_key, b_key, c_key).
and aggregate as independent values. (if possible)

>
> 2) Also, if the data for a column is spread wide across blocks and maybe
> even across nodes how will HBase help in aggregation?

Think and optimize your data layout for your tasks. HBase is not an rdbs.
You should plan you schema in a way, that suites best for your queries.

>
> 3) Also, about storing data using an incremental row key, initially there
> will be a hot stop with the data getting to a single region. Even after a
> split of the region into two, the first one won't be getting any data (in
> incremental row key) and the second one will be hammered.

a) As in 1), add something to key. For example each 5 minutes. Later your
can issue 16 queries and merge them (for realtime)
b) If this data for mapreduce, you can do key day + (md5(time)) later MR
task collect all data in right place for aggregation.
(as usual you must tradeoff write speed and query speed).
c) split your incoming data by other field, for example host or metric.
You can look at data model of the http://opentsdb.net/

>
> One of the approach to alleviate this is not to insert incremental row keys
> from the client and have the row keys scattered for better load balancing.
> But, this approach is not efficient if I want to get events in a time
> sequence, in which case I have to use some filters to scan the entire data.
>
> 4) Still not clear why I can't have 10 column families in HBase and why
> only 2 or 3 of them according to this link (1)?

You can.
But
a) you should tune a bunch of parameters
hbase.hregion.memstore.block.multiplier,
hbase.hstore.blockingStoreFiles and others
to get it works at high write load. But according to architecture
of memstore and splits less families performs better.
b) you can write small benchmark and see, that 2 family is significally faster
then 10.


>
> (1) - http://hbase.apache.org/book/number.of.cfs.html
>
> Praveen
>
> On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <mcsrivas@gmail.com> wrote:
>
>> Praveen,
>>
>>  basically you are correct on all counts. If there are too many columns,
>>  HBase will have to issue more disk-seeks  to extract only the particular
>> columns you need ... and since the data is laid out horizontally there are
>> fewer common substrings in a single HBase-block and compression quality
>> starts to degrade due to reduced redundancy.
>>
>>
>> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
>> <praveensripati@gmail.com>wrote:
>>
>> > Thanks for the response.
>> >
>> > > The contents of a row stay together like a regular row-oriented
>> database.
>> >
>> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> >
>> > Is the above statement true for a HFile?
>> >
>> > Also from the above example, the data for the column family qualifier are
>> > not adjacent to take advantage of compression (
>> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this
>> a
>> > proper statement?
>> >
>> > Regards,
>> > Praveen
>> >
>> > On Sat, Jan 21, 2012 at 9:03 PM, <yuzhihong@gmail.com> wrote:
>> >
>> > > Have you considered using AggregationProtocol to perform aggregation ?
>> > >
>> > > Thanks
>> > >
>> > >
>> > >
>> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
>> praveensripati@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > 1) According to the this url (1), HBase performs well for two or
>> three
>> > > > column families. Why is it so?
>> > > >
>> > > > 2) Dump of a HFile, looks like below. The contents of a row stay
>> > together
>> > > > like a regular row-oriented database. If the column family has 100
>> > column
>> > > > family qualifiers and is dense then the data for a particular column
>> > > family
>> > > > qualifier is spread wide. If I want to do an aggregation on a
>> > particular
>> > > > column identifier, the disk seeks doesn't seems to be much better
>> than
>> > a
>> > > > regular row-oriented database.
>> > > >
>> > > > Please correct me if I am wrong.
>> > > >
>> > > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> > > >
>> > > > (1) - http://hbase.apache.org/book/number.of.cfs.html
>> > > >
>> > > > Thanks,
>> > > > Praveen
>> > >
>> >
>>



-- 
Andrey.

Mime
View raw message