hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Nguyen <andrew-lists-hb...@ucsfcti.org>
Subject Re: Modeling column families
Date Sat, 24 Apr 2010 20:36:31 GMT
Ryan,

Extremely helpful, and definitely something to think about.  My intuition says the row-oriented
approach is much better for us since there's a (potentially) unbounded amount of data being
fed into the system.

In your eventId example, what was your main reason for not using eventId as a column name?
 Is it a too large of a set?  Or, were there other factors affecting your decision?

I'm asking because given your advice so far, I'm considering the following for my key schema:

<patient id><timestamp>

And then having each physiologic parameter be a column.  The set is fairly small, right now
there are about 40-70 parameters, though this may increase.  It also varies from patient to
patient since they are not all hooked up to the same machines.

The alternative is to go what you have done with eventId and have the following be my schema:

<patient id><timestamp><signal id>

So, I'm trying to figure out what questions I need to ask in order to make the right decisions.
 I definitely think the row-oriented approach has great benefit here, based on what I'm learning
so far, mostly from the scalability standpoint.  One of the other things we're considering
is splitting the cluster across two datacenters (one in San Francisco and one in San Diego)
since there's really no feasible way to back up the amount of data we're anticipating.  I
haven't looked into this much for HDFS either and I'm not sure how this factors into the splitting
for HBase.

In terms of queries, most of our queries would probably be:

All values for a subset of signals for a particular patient in a given date range
All values for a subset of signals for a particular patient
All values for all signals for a particular patient in a given date range
All values for all signals for a particular patient

These would probably be the most common though people may find new ways to use the data.

Thanks!
Mime
View raw message