hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject hbase evaluation questions
Date Wed, 14 Jul 2010 08:25:51 GMT
I am trying to evaluate hbase to be used as an analytical data store, and I
have a few questions I have not been able to answer from the wiki or
googling in general.

1) How can hbase be configured for a multi-tenancy model? What are the
options to create a solid separation of data? In a relational database
schemas would provide this and in cassandra the keyspace can provide the
same. Of course we can add the tenancy key to the row key and create tenant
specific tables/column families but that does not provide the same level of
confidence of separation. We could also create separate clusters for each
client, but then that defeats part of the point of going to a distributed
database cluster to improve overall throughput+utilization across all
clients. We currently run single MySQL databases for each of our clients
(1-3 TBs each).

2) I am trying to model data within hbase and I am unable to truly model it
as a column based data store due to the limitations of the API
(hbase.thrift) in terms of getting back data for certain columns. I see
information for defining a bloom filter which I believe could help speed up
the retrieval of certain columns within a large row but the API does not
seem to offer the ability to iterate through the columns. The API supports
the ability to request a list of columns but no way that I have seen to scan
columns for a given row key based on a start/stop column. This forces us to
create a tall data model vs. a wide data model which in the end we think
will hurt performance as more rows will be required.

The data model is a std star schema in relational terms with a time
dimension. Time is only down to the daily granularity and we would prefer to
have this be part of the column key instead of the row key. From all
examples I have seen time has always been added to the end of the row key to
be accessed via row scans. In Cassandra for example time is modeled as a
super column or column composite index and the API supports a range get
against a set of columns within a single row.

Any advice or pointers would be greatly appreciated. Thanks in advance!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message