hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angus He <angu...@gmail.com>
Subject Re: hbase evaluation questions
Date Wed, 14 Jul 2010 10:32:54 GMT
> 1) How can hbase be configured for a multi-tenancy model? What are the
> options to create a solid separation of data? In a relational database
> schemas would provide this and in cassandra the keyspace can provide the
> same. Of course we can add the tenancy key to the row key and create tenant
> specific tables/column families but that does not provide the same level of
> confidence of separation. We could also create separate clusters for each
> client, but then that defeats part of the point of going to a distributed
> database cluster to improve overall throughput+utilization across all
> clients. We currently run single MySQL databases for each of our clients
> (1-3 TBs each).

I was just wondering why you need to separate the analytical data into
different tables or hbase instances.
Data reliability or security?
By the way, In the bigtable paper, Google mentioned that they packed
data of all web sites into two tables.
raw click table for the end-user session, and summary table for summary data.
Actually, we also did, and it works all right so far.

> 2) I am trying to model data within hbase and I am unable to truly model it
> as a column based data store due to the limitations of the API
> (hbase.thrift) in terms of getting back data for certain columns. I see
> information for defining a bloom filter which I believe could help speed up
> the retrieval of certain columns within a large row but the API does not
> seem to offer the ability to iterate through the columns. The API supports
> the ability to request a list of columns but no way that I have seen to scan
> columns for a given row key based on a start/stop column. This forces us to
> create a tall data model vs. a wide data model which in the end we think
> will hurt performance as more rows will be required.

Hbase has no built-in support for column range scan.
But you can roll out an implementation of your own based on the
versatile Hbase filter mechanism.
You probably do not need column range scan support at all.

In my opinion, tall table is more efficient.

1. fat table probably need to process more data to get the same result.

tall table:
row1: foobar-date-1
row2: foobar-date-2
row 1000: foobar-date-1000

fat table:
row1: foobar   columns: data-1, date2,....., data1000

Assume you want to retrieve data between data50 and data 100.
In the case of tall table, just set the scan start key:
foobar-data-50, and the end key: foobar-data-100, only 50 keyvalue
items are touched.
But for fat table, you have to skip the first 49 columns, date-1 -
date-49, then stop at the column data-100, 100 keyvalue items
involved. It will not be true if HBase supports the seek operation
some day.

2. more flexiable granularity when parallel query are employed.

> The data model is a std star schema in relational terms with a time
> dimension. Time is only down to the daily granularity and we would prefer to
> have this be part of the column key instead of the row key. From all
> examples I have seen time has always been added to the end of the row key to
> be accessed via row scans. In Cassandra for example time is modeled as a
> super column or column composite index and the API supports a range get
> against a set of columns within a single row.
> Any advice or pointers would be greatly appreciated. Thanks in advance!
> Wayne


View raw message