On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote: > Ryan, > > You are currently only storing the latest nickname, not all 3? I'm > trying > to understand your use case exactly. Yes, the multiple values are being stored, in fact far more than 3. We've defined the tables to use the max number of versions. We currently can store something to the effect of: user123=>props:nickname:1243940086:Ryan user123=>props:nickname:1243940087:Ryan McDonough user123=>props:nickname:1243940088:Some guy asking questions user123=>props:nickname:1243940089:Ryan user123=>props:nickname:1243940090:Ryan user123=>props:nickname:1243940091: user123=>props:nickname:1243940092:Ryan McDonough Where "props" is the column family. One thing that is challenging is that because the versions are keyed by timestamp, you don't have a mechanism to handle duplicate values, thus it's possible to have the same value repeated multiple times. Also, you don't have insight into whether or not the value was the result of an insert or an accidental dupe, or a deletion. Additionally, we can only evaluate a row filter the most recent column value,but IIRC, that's fixed in 0.20. > > Whether you want to use versions or not depends on what you want to do > with these multiple values. > > Versions are intended for versioning, as in, multiple values for the > same > column that are timestamped and sorted with most recent first. Yes, I understand that part. But what I'm trying to clarify is why store versions keyed only by timestamp and not by another arbitrary value? As I mentioned in my initial question, I'm starting to see versions as a means to provide some means of optimistic locking. To quote the BigTable paper: "Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first. To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable to garbage- collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new- enough versions be kept (e.g., only keep values that were written in the last seven days). " With that said, I'm just trying to get some clarity on how HBase utilizes versions internally and if there's any change of seeing some unintended consequences of using versions for something other than versions? For example, does having multiple versions add additional overhead at compaction time or when region splits occur? To put it another way:Based on my current understanding of HBase versions, I could equate it to using an audit schema in an RDBMS to join multiple values. While it's possible, it's not what you'd use an audit schema for. > It seems from what you said that versions will work nicely. With > the new > API in the upcoming 0.20, there is much better support dealing with > multiple versions. Yes, it does work quite nicely, however I just feel like something's wrong with our design. Thanks for the response. Ryan- > > JG > > On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote: >> I'm trying to get some clarity on the role of versions in HBase. Our >> table design is such that a an object can have multiple property >> values for >> a given property name. For example, we could have an nickname >> property >> that a given person is known by. In the current set up, if a person >> has 3 >> nicknames, only the last one gets stored. We have considered using >> the >> column versions as an added data dimension, but that just doesn't >> feel >> quite right. Given that columns have a limit (granted that it's quite >> large) as to how many versions it can store, it's still a limit >> none the >> less. >> >> From what I gather from reading the BigTable doc, is that version >> could be considered a form of optimistic locking so that concurrent >> writes >> don't conflict. Is that understanding correct? If not, is using >> versions >> as an added data dimension a good idea? >> >> Ryan- >> >> >> >