hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan J. McDonough" <r...@damnhandy.com>
Subject Re: Clarifying the role of HBase Versions
Date Tue, 02 Jun 2009 11:16:02 GMT

On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:

> Ryan,
> You are currently only storing the latest nickname, not all 3?  I'm  
> trying
> to understand your use case exactly.

Yes, the multiple values are being stored, in fact far more than 3.  
We've defined the tables to use the max number of versions. We  
currently can store something to the effect of:

user123=>props:nickname:1243940087:Ryan McDonough
user123=>props:nickname:1243940088:Some guy asking questions
user123=>props:nickname:1243940092:Ryan McDonough

Where "props" is the column family. One thing that is challenging is  
that because the versions are keyed by timestamp, you don't have a  
mechanism to handle
duplicate values, thus it's possible to have the same value repeated  
multiple times. Also, you don't have insight into whether or not the  
value was the result of an insert or an accidental dupe, or a  
deletion. Additionally, we can only evaluate a row filter the most  
recent column value,but IIRC, that's fixed in 0.20.

> Whether you want to use versions or not depends on what you want to do
> with these multiple values.
> Versions are intended for versioning, as in, multiple values for the  
> same
> column that are timestamped and sorted with most recent first.

Yes, I understand that part. But what I'm trying to clarify is why  
store versions keyed only by timestamp and not by another arbitrary  
value? As I mentioned in my initial question, I'm starting to see  
versions as a means to provide some means of optimistic locking. To  
quote the BigTable paper:

"Applications that need to avoid collisions must generate unique  
timestamps themselves. Different versions of a cell are stored in  
decreasing timestamp order, so that the most recent versions can be  
read first.  To make the management of versioned data less onerous, we  
support two per-column-family settings that tell Bigtable to garbage- 
collect cell versions automatically. The client can specify either  
that only the last n versions of a cell be kept, or that only new- 
enough versions be kept (e.g., only keep values that were written in  
the last seven days). "

With that said, I'm just trying to get some clarity on how HBase  
utilizes versions internally and if there's any change of seeing some  
unintended consequences of using versions for something other than  
versions? For example, does having multiple versions add additional  
overhead at compaction time or when region splits occur?

To put it another way:Based on my current understanding of HBase  
versions, I could equate it to using an audit schema in an RDBMS to  
join multiple values. While it's possible, it's not what you'd use an  
audit schema for.

> It seems from what you said that versions will work nicely.  With  
> the new
> API in the upcoming 0.20, there is much better support dealing with
> multiple versions.

Yes, it does work quite nicely, however I just feel like something's  
wrong with our design. Thanks for the response.


> JG
> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
>> I'm trying to get some clarity on the role of versions in HBase. Our
>> table design is such that a an object can have multiple property  
>> values for
>> a given property name. For example, we could have an nickname  
>> property
>> that a given person is known by. In the current set up, if a person  
>> has 3
>> nicknames, only the last one gets stored. We have considered using  
>> the
>> column versions as an added data dimension, but that just doesn't  
>> feel
>> quite right. Given that columns have a limit (granted that it's quite
>> large) as to how many versions it can store, it's still a limit  
>> none the
>> less.
>> From what I gather from reading the BigTable doc, is that version
>> could be considered a form of optimistic locking so that concurrent  
>> writes
>> don't conflict. Is that understanding correct? If not, is using  
>> versions
>> as an added data dimension a good idea?
>> Ryan-

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message