On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:
> Ryan,
>
> You are currently only storing the latest nickname, not all 3? I'm
> trying
> to understand your use case exactly.
Yes, the multiple values are being stored, in fact far more than 3.
We've defined the tables to use the max number of versions. We
currently can store something to the effect of:
user123=>props:nickname:1243940086:Ryan
user123=>props:nickname:1243940087:Ryan McDonough
user123=>props:nickname:1243940088:Some guy asking questions
user123=>props:nickname:1243940089:Ryan
user123=>props:nickname:1243940090:Ryan
user123=>props:nickname:1243940091:
user123=>props:nickname:1243940092:Ryan McDonough
Where "props" is the column family. One thing that is challenging is
that because the versions are keyed by timestamp, you don't have a
mechanism to handle
duplicate values, thus it's possible to have the same value repeated
multiple times. Also, you don't have insight into whether or not the
value was the result of an insert or an accidental dupe, or a
deletion. Additionally, we can only evaluate a row filter the most
recent column value,but IIRC, that's fixed in 0.20.
>
> Whether you want to use versions or not depends on what you want to do
> with these multiple values.
>
> Versions are intended for versioning, as in, multiple values for the
> same
> column that are timestamped and sorted with most recent first.
Yes, I understand that part. But what I'm trying to clarify is why
store versions keyed only by timestamp and not by another arbitrary
value? As I mentioned in my initial question, I'm starting to see
versions as a means to provide some means of optimistic locking. To
quote the BigTable paper:
"Applications that need to avoid collisions must generate unique
timestamps themselves. Different versions of a cell are stored in
decreasing timestamp order, so that the most recent versions can be
read first. To make the management of versioned data less onerous, we
support two per-column-family settings that tell Bigtable to garbage-
collect cell versions automatically. The client can specify either
that only the last n versions of a cell be kept, or that only new-
enough versions be kept (e.g., only keep values that were written in
the last seven days). "
With that said, I'm just trying to get some clarity on how HBase
utilizes versions internally and if there's any change of seeing some
unintended consequences of using versions for something other than
versions? For example, does having multiple versions add additional
overhead at compaction time or when region splits occur?
To put it another way:Based on my current understanding of HBase
versions, I could equate it to using an audit schema in an RDBMS to
join multiple values. While it's possible, it's not what you'd use an
audit schema for.
> It seems from what you said that versions will work nicely. With
> the new
> API in the upcoming 0.20, there is much better support dealing with
> multiple versions.
Yes, it does work quite nicely, however I just feel like something's
wrong with our design. Thanks for the response.
Ryan-
>
> JG
>
> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
>> I'm trying to get some clarity on the role of versions in HBase. Our
>> table design is such that a an object can have multiple property
>> values for
>> a given property name. For example, we could have an nickname
>> property
>> that a given person is known by. In the current set up, if a person
>> has 3
>> nicknames, only the last one gets stored. We have considered using
>> the
>> column versions as an added data dimension, but that just doesn't
>> feel
>> quite right. Given that columns have a limit (granted that it's quite
>> large) as to how many versions it can store, it's still a limit
>> none the
>> less.
>>
>> From what I gather from reading the BigTable doc, is that version
>> could be considered a form of optimistic locking so that concurrent
>> writes
>> don't conflict. Is that understanding correct? If not, is using
>> versions
>> as an added data dimension a good idea?
>>
>> Ryan-
>>
>>
>>
>
|