hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Secondary indexes
Date Tue, 22 Apr 2008 18:58:20 GMT
Some questions interlaced below:

Clint Morgan wrote:
> All,
>
> We want to put secondary indexes into hbase. The motivation is that we
> are storing data in hbase that we want to serve to users. We would
> like to be able to serve rows sorted by column values. Our queries
> will be over rows with a given key prefix, so we should not be hitting
> to many regions.
>   
> I was thinking it would work roughly like this:
>
> - At table creation time, individual columns can be declared as
> indexed. By default we could sort the column values lexicographically,
> or we can provide a WritableComparatorFactory<T> which has the ability
> to make values of type T from a byte [], as well as providing a
> Comparator<T>. (Better than providing a Comparator<byte[]> as it only
> costs once per row insert for deserialization, rather that twice on
> each comparison).
>   

I don't follow what the Factory adds.

We're talking about getting HBASE-82 into 0.2.  Does that interfere with 
this proposal?  (I'm thinking that along w/ rows becoming byte arrays 
rather Text with a client-supplied Comparator, column qualifiers would 
shift to be byte arrays also -- though yeah, implies that if your sort 
is not byte-lexicographical, yes, the compares can be costly involving 
two deserializations per compare).
> - We catch all writes/deletes and maintain a SortedMap<T, HStoreKey>
> which keeps the column values in order, and maps them back to row
> keys. First cut may just keep all this in memory, but it should be
> backed with MapFile(s).
>   

Would be sweet if you could leverage the HBase memcache code and flusher 
to do the above.

This Map would be global for the table?  Or per Region?

A lucene index wouldn't work for you because you want ordering?

> - Add to the hregion the ability to scan through keys in column order.
> Just iterate through the SortedMap, run a filter on the key, and if it
> passes do a get on the row.
>   

You'd be random reading rows.  You're OK w/ current performance?  (For 
sure it will only improve but....).

> - Add a ColumnOrderedClientScanner which will open column order
> scanners to all applicable hregions, and continuously pick row with
> the lowest column value from each of the client scanners.
>   

This scanner would have a significant client-side component to do the 
arbitrage between all regions to figure the lowest column value?  If you 
had a new type of 'region' -- one denoted by lowest and upper column 
then the client-side logic would fade away and your scanner would look 
like current scanners.
> - Region splits should be easy enough, just a scan through the
> SortedMap to partition.
>   

Splits would not be row-based and run as they currently do, but rather 
sorted-column based?

> Of course, the index could also be used for more efficient querying on
> the indexed column's values.
>
> Do other users have a need for this functionality?
>
> What do developers think about this? I know hbase is more intended for
> back-end batch style processing, but we have this need.
>
>   
How are you thinking of adding in this new functionality?  Subclassing 
HRegionServer?

St.Ack

> Cheers,
> -clint
>   


Mime
View raw message