hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: hbase evaluation questions
Date Wed, 14 Jul 2010 10:56:15 GMT
We are a SaaS provider and we want to move to a more shared model (vs. 1
mysql server per client) but have concerns about going to a completely
mixed/shared model where everything is mixed together. It is a physiological
leap we are not really ready to make and as a paid for SaaS business we need
to retain some level of separation between clients' data contractually
(separate backups etc.).

As far as tall vs. wide I see how tall can be beneficial and will work the
best. To me that means hbase is not really a column based data store as
there is no way to efficiently access the millions in the billions of rows.


On Wed, Jul 14, 2010 at 12:32 PM, Angus He <angushe@gmail.com> wrote:

> > 1) How can hbase be configured for a multi-tenancy model? What are the
> > options to create a solid separation of data? In a relational database
> > schemas would provide this and in cassandra the keyspace can provide the
> > same. Of course we can add the tenancy key to the row key and create
> tenant
> > specific tables/column families but that does not provide the same level
> of
> > confidence of separation. We could also create separate clusters for each
> > client, but then that defeats part of the point of going to a distributed
> > database cluster to improve overall throughput+utilization across all
> > clients. We currently run single MySQL databases for each of our clients
> > (1-3 TBs each).
> I was just wondering why you need to separate the analytical data into
> different tables or hbase instances.
> Data reliability or security?
> By the way, In the bigtable paper, Google mentioned that they packed
> data of all web sites into two tables.
> raw click table for the end-user session, and summary table for summary
> data.
> Actually, we also did, and it works all right so far.
> > 2) I am trying to model data within hbase and I am unable to truly model
> it
> > as a column based data store due to the limitations of the API
> > (hbase.thrift) in terms of getting back data for certain columns. I see
> > information for defining a bloom filter which I believe could help speed
> up
> > the retrieval of certain columns within a large row but the API does not
> > seem to offer the ability to iterate through the columns. The API
> supports
> > the ability to request a list of columns but no way that I have seen to
> scan
> > columns for a given row key based on a start/stop column. This forces us
> to
> > create a tall data model vs. a wide data model which in the end we think
> > will hurt performance as more rows will be required.
> Hbase has no built-in support for column range scan.
> But you can roll out an implementation of your own based on the
> versatile Hbase filter mechanism.
> You probably do not need column range scan support at all.
> In my opinion, tall table is more efficient.
> 1. fat table probably need to process more data to get the same result.
> tall table:
> row1: foobar-date-1
> row2: foobar-date-2
> ...
> row 1000: foobar-date-1000
> fat table:
> row1: foobar   columns: data-1, date2,....., data1000
> Assume you want to retrieve data between data50 and data 100.
> In the case of tall table, just set the scan start key:
> foobar-data-50, and the end key: foobar-data-100, only 50 keyvalue
> items are touched.
> But for fat table, you have to skip the first 49 columns, date-1 -
> date-49, then stop at the column data-100, 100 keyvalue items
> involved. It will not be true if HBase supports the seek operation
> some day.
> 2. more flexiable granularity when parallel query are employed.
> > The data model is a std star schema in relational terms with a time
> > dimension. Time is only down to the daily granularity and we would prefer
> to
> > have this be part of the column key instead of the row key. From all
> > examples I have seen time has always been added to the end of the row key
> to
> > be accessed via row scans. In Cassandra for example time is modeled as a
> > super column or column composite index and the API supports a range get
> > against a set of columns within a single row.
> >
> > Any advice or pointers would be greatly appreciated. Thanks in advance!
> >
> > Wayne
> >
> --
> Regards
> Angus

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message