hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Questions on FuzzyRowFilter
Date Sat, 10 May 2014 06:57:57 GMT
3+ Years on and a bad idea is being propagated again. 

Now repeat after me… DO NO USE A SALT.

Having a low sodium diet, especially for HBase is really good for your health and sanity.

The salt is going to be orthogonal to the row key (Key). 
There is no relationship to the specific Key. 

Using a salt means you now use the ability to randomly spread the distribution of data to
However you lose the ability to seek for a specific row. 


The hash whether you use SHA-1 or MD-5 is going to yield the same result each and every time
you provide the key.

But wait, the generated hash is 160 bits long. We don’t need that!
Absolutely true if you just want to randomize the key to avoid hot spotting. There’s this
concept called truncating the hash to the desired length. 
So to Adrien’s point, you can truncate it to a single byte which would be sufficient….
Now when you want to seek for a specific row, you can find it. 

The downside to any solution is that you lose the ability to do a range scan. 
get() CALL.

This simple fact has been pointed out several years ago, yet for some reason, the use of a
salt persists. 
I’ve actually made that part of the HBase course I wrote and use it in my presentation(s)
on HBase. 

It amazes me that the committers and regulars who post here still don’t grok the fact that
if you’re going to ‘SALT’ a row, you might as well not use HBase and stick with Hive.

I remember Ed C’s rant about how preferential treatment on Hive patches was given to vendors’
committers… that preferential treatment seems to also be extended speakers at conferences.
It wouldn’t be a problem if those said speakers actually knew the topic… ;-) 

Propagation of bad ideas means that you’re leaving a lot of performance on the table and
it can kill or cripple projects.


Sorry for the rant…


On May 3, 2014, at 4:39 PM, Software Dev <static.void.dev@gmail.com> wrote:

> Ok so there is no way around the FuzzyRowFilter checking every single
> row in the table correct? If so, what is a valid use case for that
> filter?
> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> client for accessing these tables is a Rails (not JRuby) application
> so we are stuck with either the Thrift or Rails client. Can either of
> these perform multiple gets/scans?
> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <adrien.mogenet@gmail.com> wrote:
>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
>> split enough among all the possible regions, but you won't be able to
>> easily benefit from distributed scans to gather what you want.
>> Let say you want to split (time+login) with a salted key and you expect to
>> be able to retrieve events from 20140429 pretty fast. Then I would split
>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
>> % 10'). To retrieve ordered data, I would parallelize Scans over the 10
>> span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
>> until I've got all the expected results.
>> So in term of performances this looks "a little bit" faster than your 2^32
>> randomization.
>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <static.void.dev@gmail.com>wrote:
>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>> time series data (20140501, 20140502...).  We can prefix all of the
>>> keys with 4 random bytes and then just skip these during scanning. Is
>>> that correct? These *seems* like it will work but Im questioning the
>>> performance of this even if it does work.
>>> Also, is this available via the rest client, shell and/or thrift client?
>>> Also, is there a FuzzyColumn equivalent of this feature?
>> --
>> Adrien Mogenet
>> http://www.borntosegfault.com

View raw message