hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill de hOra <b...@dehora.net>
Subject Re: usecase: tagged key/values
Date Thu, 12 Feb 2009 20:32:20 GMT
Jonathan Gray wrote:
> Bill,
> 
> It's hard to say whether hbase is a good fit without knowing a bit more.
> 
> HBase is very well suited for storing data in the format you describe.  If
> your primary problem is scaling the persistence of this dataset, it can
> certainly do that.  You can have any number of arbitrary key/vals for each
> row, and any number of rows.  The example you show looks almost exactly
> like an HBase schema.
> 
> Your row key would be "8904830324" and you would have a single family that
> contained a column per key/val.  The column name is the key, the column
> value is the val.  You could have one key/val in one row, and 1000 in
> another row, this schema is not at all fixed.
> 
> But I really need to better understand the expected dimensions of your
> dataset and how you'd like to query it to know if that's the right schema.
> 
> Do you expect very high numbers of key/vals per identifier?  10, 100,
> 1000, more?  

I'd say in the range 5-20. The number of identifiers is at least 10s of 
millions.


> And would they be consistent across the identifiers (within a
> deployment, or table in this case) or would they vary greatly between
> rows?  

Reasonably consistent; not every identifer will have all values.


> Also, are you going to be querying this in realtime and concurrently? 
> Will you be storing lots of data and processing it in batch?  Are you
> write heavy or read heavy?

Read dominated; easily 80-85% of calls. The calls are realtime, but I 
have the option to cache that data heavily.


> As you can see, you have to think carefully about how you're going to be
> inserting and querying the data to determine how best to store it.  I'm
> looking forward to hearing more details because it sounds like an
> interesting (and potentially common) problem to solve.

So in this case each identifier is a user or a community key; as I said 
those are in the tens of millions. And they have some arbitrary 
key/values associated with them, 10-20 each, but typically the keys are 
common across users. In some cases there's a need to do a reverse look 
up by the key's value, eg "find all users where foo=10", but they are a 
subset again of the total key set.

Another user case is being able to store semi-structured data for media, 
eg exif values or a controlled set of tags. Again there aren't that many 
keys, but the media count is big - 100s of millions of items.

In both cases reads outstrip writes, probably 10 to 1. In the media 
case, most writes are new data being put in. it's the kind of data that 
in RDBMSes winds up in extension tables, which become harder to manage 
as they get bigger.

Bill

Mime
View raw message