hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juh...@ninja.co.jp>
Subject Behaviour of filters within scans
Date Mon, 19 Apr 2010 03:48:22 GMT
I've spent some time looking through the regionscanner logic, in 
particular the filter related parts and would want to check if a) my 
current understanding is correct and b) if this may be subject to change.

short/simplified version to avoid getting sidetracked:
- A RegionScanner is built from a series of scanners attached to each 
Store.
- This list of scanners is stored in a KeyValueHeap which compares 
KeyValues to sort the order in which entries are retrieved by 
RegionScanner->next
  - To check the order in which keys will be returned, and thus filtered 
one can look at KeyValue.KeyComparator->compare. It's something like: 
sort by row, then column family, then column, then timestamp

Filters are applied as described in
http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html

In the end, when using filterKeyValue(KeyValue) one can expect the 
keyValues to be sent to it in a sorted order. Will this always be the case?

I ask this because I currently plan to filter the values of col-b based 
on the values in col-a. This could be achieved by making sure col-a 
compares lower than col-b and storing some kind of data(e.g. a list of 
"ok" timestamps) within the custom filter. Does this all sound ok?

Finally it would be nice to see the option to filter a full set, as 
naming columns to guarrantee a certain sorting for filters seems pretty 
dubious:
- Probably in HRegion.Regionserver->next after nextInternal, before 
filterRow?
- This would allow a potential filter to go through the gathered results 
and prune them depending on intercolumn dependencies?
- I believe it would unlock a lot of possibilities for custom filters 
that could cut down on significant amount of transfers where a rows data 
could be pruned regionserver side rather than at the client. My 
particular application is to only store col-b where there is a col-a 
with a corresponding timestamp that matches specific conditions. In my 
particular case this results in massive reductions in the amount of 
cells being sent from the regionserver.

Any thoughts would be appreciated.

As an aside, I believe HRegion.RegionScanner->nextInternal is doing 
filterRowKey for every key in a row even if it has passed once? Is this 
intentional behaviour(it seems somewhat unexpected), as otherwise it 
could be optimised by just checking the samerow variable.

Mime
View raw message