hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juh...@ninja.co.jp>
Subject Re: Behaviour of filters within scans
Date Mon, 19 Apr 2010 05:30:19 GMT
Thanks for your response

On 04/19/2010 12:59 PM, Ryan Rawson wrote:
> I think all the functionality is there between these 2 calls:
> Filter#filterKeyValue(KeyValue kv);
> and
> Filter#filterRow();
> In the first call you can cache the KeyValues locally in the filter
> state (in a List<KeyValue>  for example).  In the last call you can do
> your custom logic based on all the KeyValues you have seen.  There is
> little to no cost to do this, since retaining references to a KeyValue
> is cheap (ish, relatively, etc).
But ultimately the only thing I can do with Filter#filterRow() is drop 
the full row? Am I missing something here? Were I to store references to 
all the key values that have passed through at most I could zero out 
their buffers in the #filterRow call? I'm not sure what the consequences 
of this might be afterwords as the scanner tries to send a load of empty 
cells. Looking at HRegionServer#next(final long scannerId, int nbRows), 
it seems to me that they would get packed into Result to get sent back 
to the client. I could certainly cut down on a lot of transfer by just 
sending "empty" keyvalues, but it still seems like a lot of overhead 
that could be lost by a small api change. Or am I missing something here?

> The filter implementation has changed a bit since August 2009, and it
> might be possible to create a call like
> Filter#filterRow(List<KeyValue>  results) that is called at the "end"
> of a row... you can get the same effect as I noted above.  It is just
> a matter of API, not of semantics.
Having followed the code, it did seem like it would be trivial to 
implement such an extra api either before or after the 
Filter#filterRow(). I believe the option of having the ability to knock 
keyvals out of the list would save on processing later.
I would be happy to try putting together the minor modification to 
RegionScanner and adding a unit test if such a modification were welcome.

> I would generally discourage you from structuring your data to fit an
> internal implementation detail.  While there are no current plans to
> change sorting order, it would make your code more brittle.
I certainly wouldn't want to do it :) I'm going to have to see how much 
overhead I get with a) just dealing with it client end or b) keeping 
references and zeroing the keyvals and go from there.

> -ryan
> On Sun, Apr 18, 2010 at 8:48 PM, Juhani Connolly<juhani@ninja.co.jp>  wrote:
>> I've spent some time looking through the regionscanner logic, in particular
>> the filter related parts and would want to check if a) my current
>> understanding is correct and b) if this may be subject to change.
>> short/simplified version to avoid getting sidetracked:
>> - A RegionScanner is built from a series of scanners attached to each Store.
>> - This list of scanners is stored in a KeyValueHeap which compares KeyValues
>> to sort the order in which entries are retrieved by RegionScanner->next
>>   - To check the order in which keys will be returned, and thus filtered one
>> can look at KeyValue.KeyComparator->compare. It's something like: sort by
>> row, then column family, then column, then timestamp
>> Filters are applied as described in
>> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html
>> In the end, when using filterKeyValue(KeyValue) one can expect the keyValues
>> to be sent to it in a sorted order. Will this always be the case?
>> I ask this because I currently plan to filter the values of col-b based on
>> the values in col-a. This could be achieved by making sure col-a compares
>> lower than col-b and storing some kind of data(e.g. a list of "ok"
>> timestamps) within the custom filter. Does this all sound ok?
>> Finally it would be nice to see the option to filter a full set, as naming
>> columns to guarrantee a certain sorting for filters seems pretty dubious:
>> - Probably in HRegion.Regionserver->next after nextInternal, before
>> filterRow?
>> - This would allow a potential filter to go through the gathered results and
>> prune them depending on intercolumn dependencies?
>> - I believe it would unlock a lot of possibilities for custom filters that
>> could cut down on significant amount of transfers where a rows data could be
>> pruned regionserver side rather than at the client. My particular
>> application is to only store col-b where there is a col-a with a
>> corresponding timestamp that matches specific conditions. In my particular
>> case this results in massive reductions in the amount of cells being sent
>> from the regionserver.
>> Any thoughts would be appreciated.
>> As an aside, I believe HRegion.RegionScanner->nextInternal is doing
>> filterRowKey for every key in a row even if it has passed once? Is this
>> intentional behaviour(it seems somewhat unexpected), as otherwise it could
>> be optimised by just checking the samerow variable.

View raw message