hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Table design question
Date Wed, 18 Feb 2009 17:36:21 GMT
On Wed, Feb 18, 2009 at 2:24 AM, Jérôme Thièvre INA <jthievre@ina.fr> wrote:

> Hi,
> I setup a cluster of 4 machines running hbase.
> I'm working on a web archiving application that needs to access (randomly)
> records with request of type :
> Record record = getClosestRecord(url, requestedDate);
> This method should find the record for the specified url at the *nearest
> *date
> from the requestedDate. The requested dates have very little chance to
> match
> insertion date.

(wayback machine?)

Currently we can only return records at an explicit date or older, not

Each record is made of 10 columns, and each insert is of the type;
> insertRecord(url, date, record);
> There are several possible designs for my record table :
> 1. RowKey= url and all columns are labelled with the same date.

2. RowKey=url and we use timestamp and version support of hbase, and columns
> names are columnFamily names (no label).
3. RowKey=url+date, and columns names are columnFamily names (no label).

Examples please (I've only had one cup of coffee so far this morning).

> For now, I use method 1 that implies to answer correctly to
> getClosestRecord
> to load an entire columnFamily for a specified row,
> to find the closest date among the columnFamily, and to load  the others
> columns labelled with this closest date.
> I choose this method because I thought I could use the method
> HTable.getClosestRowBefore(url, columFamily:requestedDate) to minimize
> column loads, but in fact I need the closest row before and the closest row
> after to determine which one is at the closest date, so I don't use the
> method getClosestRowBefore.
> The solution 2. seems to be a good alternative, I could have the same
> fonctionnality with the same process, but date would be stored once per row
> insert (as timestamp) instead of once per column.

This seems like a better hbase fit.

> Solution 3. implies only one insert per row key, but increases dramatically
> the number of rows.

Yeah, but you can scan them quickly.  Good for finding date ranges (until we
enrichen the API and allow get/scan between date ranges).  You'll probably
have to do as hbase does internally, do a little trick so the newest insert
shows first -- rather than last.


> What is the best solution to ensure best random acces time ?
> Jérôme Thièvre

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message