hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuy...@snowcocoa.info>
Subject Re: Schema questions: Best practices, versions/timestamps
Date Tue, 10 Nov 2009 19:43:34 GMT
Hi Lars,

On Mon, Nov 9, 2009 at 11:59 PM, Lars Francke <lars.francke@gmail.com> wrote:
> I've read numerous threads on this mailing list and I've asked several
> times on IRC but the answers I get are rarely the same so I'd like to
> try once more.

I think this is not a right / wrong kind of a question. HBase gives
you several options to do this, and that's why people give you
different suggestions.

> I have a data model that would be a perfect match for the
> versions/timestamps that are available in HBase. Some say that it is
> perfectly feasible to use the versions as another "data dimension" and
> some say that it isn't meant to be used that way at all. The BigTable
> paper doesn't go into very much detail about this but from what I
> gathered it is indeed used as an additional dimension.

I'm thinking the same way to you; it can be used as an additional dimension.

[ The model that uses the versions as a data dimension ]

> In my data model the versions would start at 1 and be ascending - no
> timestamps but HBase doesn't enforce those.

Not sure if I understand you correctly. A row doesn't have version
value in its key but only user specified id and timestamp. Please see
the KeyValue's section of this wonderful blog post by Lars George.


> The upside of this model
> would be that only the difference between two versions would have to
> be saved and that I'd be provided with a nice API to handle versions.


[ The model that does not use the versions but compound row key ]

> The model proposed to me numerous times using a compound row key
> (model id:version) would save duplicates of the data (or I'd have to
> handle the diffs myself).

That's right.

> Another upside would be that it would
> require only a Get to get an element and its history.

I don't think this is acculate. To get its history at once, you will
not use Get but Scan with a prefix key (model id)  Also, with the
earlier model, you can still get its history with a single Get. (Get
has #setMaxVersions(int))  So, both models can do this.

I think the upside of the latter model (compound row key) is that you
can get a specific version very quickly because the version value is a
part of the key. The earlier model needs you to iterate through all
history and look at their timestamps to find the right version.

> I require "out of order" insertion to the versions and I was told that
> this is probably no problem as long as I don't delete a version. Is
> this true?

I don't have the answer. You might want to try it by yourself.

> I know that there is a limit for versions (Integer.MAX_VALUE as far as
> I can see) and for some of my tables this will be a problem so I'd end
> up using a mix of both these models anyway but if possible I'd like to
> use the version model provided by HBase where I can. I haven't seen a
> single example schema, tutorial, ... that talks about the versions in
> schemas; they seem to go mainly unused.

I couldn't find examples to retrieve a column value with specific
timestamp, and the 0.20.x API doesn't seem to have some convenience
methods to do this. You'll have to call Result#sorted() to get sorted
KeyValues, or Result#getMap() to get NavigableMaps. Then you'll
iterate thorough one of them to find a specific column with a specific

> So my question would be: Should I use versions as an important part of
> my schema or not? If not are there any tips/hints on management of
> versions using compound keys and what the versions/timestamps are used
> for if not as an additional data dimension?

It depends on how often you will search for a specific version of a
record. If you do this very often, I think the latter model (compound
row key) will be easier to work with. Otherwise, the earlier model
(use versions) can be the option.

> And one more question about a "proper" schema: I have quite a lot of
> places that merely save a list of things it relates to without
> requiring any additional information (Many-to-Many). I'd have
> introduced a new column family and used the columns as keys to another
> table but I won't need the column value. How does HBase behave in
> regard to "null" as a column value? The FAQ entry about this topic is
> a bit unclear. Or is this the wrong way to begin with?

I believe you can't literally give a "null" to a column, so use an
empty (zero-length) byte array instead. Since it's a zero-length
array, it doesn't waste any disk space.

Hope this helps,

Tatsuya Kawano (Mr.)
Tokyo, Japan

View raw message