hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Software Dev <static.void....@gmail.com>
Subject Re: Questions on FuzzyRowFilter
Date Sat, 03 May 2014 15:52:43 GMT
Edit. I should have mentioned that my access pattern is a bit
different. Ill need to scan between dates... 20140101 -> 20140501, not
an individual date. My table is actually a bunch of increments so as
of right now, there is only 1 row key per timeframe.

On Sat, May 3, 2014 at 8:39 AM, Software Dev <static.void.dev@gmail.com> wrote:
> Ok so there is no way around the FuzzyRowFilter checking every single
> row in the table correct? If so, what is a valid use case for that
> filter?
>
> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> client for accessing these tables is a Rails (not JRuby) application
> so we are stuck with either the Thrift or Rails client. Can either of
> these perform multiple gets/scans?
>
>
>
> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <adrien.mogenet@gmail.com> wrote:
>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
>> split enough among all the possible regions, but you won't be able to
>> easily benefit from distributed scans to gather what you want.
>>
>> Let say you want to split (time+login) with a salted key and you expect to
>> be able to retrieve events from 20140429 pretty fast. Then I would split
>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
>> until I've got all the expected results.
>>
>> So in term of performances this looks "a little bit" faster than your 2^32
>> randomization.
>>
>>
>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <static.void.dev@gmail.com>wrote:
>>
>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>> time series data (20140501, 20140502...).  We can prefix all of the
>>> keys with 4 random bytes and then just skip these during scanning. Is
>>> that correct? These *seems* like it will work but Im questioning the
>>> performance of this even if it does work.
>>>
>>> Also, is this available via the rest client, shell and/or thrift client?
>>>
>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>
>>
>>
>>
>> --
>> Adrien Mogenet
>> http://www.borntosegfault.com

Mime
View raw message