hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Questions on FuzzyRowFilter
Date Sun, 18 May 2014 21:17:27 GMT
@Software Dev - if you use Phoenix, queries would leverage our Skip Scan
(which supports a superset of the FuzzyRowFilter perf improvements). Take a
look here:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Assuming a row key made up of a low cardinality first value (like a byte
representing an enum), followed by a high cardinality second value (like a
date/time value) you'd get a large benefit from the skip scan when you're
only looking a small sliver of your time range.

Another option would be to create a secondary index over your date:
http://phoenix.incubator.apache.org/secondary_indexing.html

Thanks,
James


On Sun, May 18, 2014 at 1:56 PM, James Taylor <jtaylor@salesforce.com>wrote:

> The top two hits when you Google  for HBase salt are
> - Sematext blog describing "salting" as I described it in my email
> - Phoenix blog again describing "salting" in this same way
> I really don't understand what you're arguing about - the mechanism that
> you're advocating for is exactly the way both these solutions have
> implemented it. I believe we're all in agreement. It seems that you just
> aren't happy with the fact that we've called this technique "salting".
>
>
> On Sun, May 18, 2014 at 11:32 AM, Michael Segel <michael_segel@hotmail.com
> > wrote:
>
>> @James…
>> You’re not listening. There is a special meaning when you say salt.
>>
>> On May 18, 2014, at 7:16 PM, James Taylor <jtaylor@salesforce.com> wrote:
>>
>> > @Mike,
>> >
>> > The biggest problem is you're not listening. Please actually read my
>> > response (and you'll understand the what we're calling "salting" is not
>> a
>> > random seed).
>> >
>> > Phoenix already has secondary indexes in two flavors: one optimized for
>> > write-once data and one more general for fully mutable data. Soon we'll
>> > have a third for local indexing.
>> >
>> > James
>> >
>> >
>> > On Sun, May 18, 2014 at 10:27 AM, Michael Segel
>> > <michael_segel@hotmail.com>wrote:
>> >
>> >> @James,
>> >>
>> >> I know and that’s the biggest problem.
>> >> Salts by definition are random seeds.
>> >>
>> >> Now I have two new phrases.
>> >>
>> >> 1) We want to remain on a sodium free diet.
>> >> 2) Learn to kick the bucket.
>> >>
>> >> When you have data that is coming in on a time series, is the data
>> mutable
>> >> or not?
>> >>
>> >> A better approach would be to redesign a second type of storage to
>> handle
>> >> serial data and how the regions are split and managed.
>> >> Or just not use HBase to store the underlying data in the first place
>> and
>> >> just store the index… ;-)
>> >> (Yes, I thought about this too.)
>> >>
>> >> -Mike
>> >>
>> >> On May 16, 2014, at 7:50 PM, James Taylor <jtaylor@salesforce.com>
>> wrote:
>> >>
>> >>> Hi Mike,
>> >>> I agree with you - the way you've outlined is exactly the way Phoenix
>> has
>> >>> implemented it. It's a bit of a problem with terminology, though. We
>> call
>> >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash
>> the
>> >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend
>> the
>> >>> row key with this single byte value. Maybe you can coin a good term
>> for
>> >>> this technique?
>> >>>
>> >>> FWIW, you don't lose the ability to do a range scan when you salt (or
>> >>> hash-the-key and mod by the number of "buckets"), but you do need to
>> run
>> >> a
>> >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
>> Then
>> >>> the client does a merge sort among these scans. It performs well.
>> >>>
>> >>> Thanks,
>> >>> James
>> >>>
>> >>>
>> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
>> >> michael_segel@hotmail.com>wrote:
>> >>>
>> >>>> 3+ Years on and a bad idea is being propagated again.
>> >>>>
>> >>>> Now repeat after me… DO NO USE A SALT.
>> >>>>
>> >>>> Having a low sodium diet, especially for HBase is really good for
>> your
>> >>>> health and sanity.
>> >>>>
>> >>>> The salt is going to be orthogonal to the row key (Key).
>> >>>> There is no relationship to the specific Key.
>> >>>>
>> >>>> Using a salt means you now use the ability to randomly spread the
>> >>>> distribution of data to avoid HOT SPOTTING.
>> >>>> However you lose the ability to seek for a specific row.
>> >>>>
>> >>>> YOU HASH THE KEY.
>> >>>>
>> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same
>> result
>> >>>> each and every time you provide the key.
>> >>>>
>> >>>> But wait, the generated hash is 160 bits long. We don’t need that!
>> >>>> Absolutely true if you just want to randomize the key to avoid hot
>> >>>> spotting. There’s this concept called truncating the hash to the
>> desired
>> >>>> length.
>> >>>> So to Adrien’s point, you can truncate it to a single byte which
>> would
>> >> be
>> >>>> sufficient….
>> >>>> Now when you want to seek for a specific row, you can find it.
>> >>>>
>> >>>> The downside to any solution is that you lose the ability to do
a
>> range
>> >>>> scan.
>> >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
>> FETCH A
>> >>>> SINGLE ROW VIA A get() CALL.
>> >>>>
>> >>>> <rant>
>> >>>> This simple fact has been pointed out several years ago, yet for
some
>> >>>> reason, the use of a salt persists.
>> >>>> I’ve actually made that part of the HBase course I wrote and use
it
>> in
>> >> my
>> >>>> presentation(s) on HBase.
>> >>>>
>> >>>> It amazes me that the committers and regulars who post here still
>> don’t
>> >>>> grok the fact that if you’re going to ‘SALT’ a row, you might
as well
>> >> not
>> >>>> use HBase and stick with Hive.
>> >>>> I remember Ed C’s rant about how preferential treatment on Hive
>> patches
>> >>>> was given to vendors’ committers… that preferential treatment
seems
>> to
>> >> also
>> >>>> be extended speakers at conferences. It wouldn’t be a problem
if
>> those
>> >> said
>> >>>> speakers actually knew the topic… ;-)
>> >>>>
>> >>>> Propagation of bad ideas means that you’re leaving a lot of
>> performance
>> >> on
>> >>>> the table and it can kill or cripple projects.
>> >>>>
>> >>>> </rant>
>> >>>>
>> >>>> Sorry for the rant…
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On May 3, 2014, at 4:39 PM, Software Dev <static.void.dev@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Ok so there is no way around the FuzzyRowFilter checking every
>> single
>> >>>>> row in the table correct? If so, what is a valid use case for
that
>> >>>>> filter?
>> >>>>>
>> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable.
>> Our
>> >>>>> client for accessing these tables is a Rails (not JRuby) application
>> >>>>> so we are stuck with either the Thrift or Rails client. Can
either
>> of
>> >>>>> these perform multiple gets/scans?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
>> >> adrien.mogenet@gmail.com>
>> >>>> wrote:
>> >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus
your data
>> can
>> >>>> be
>> >>>>>> split enough among all the possible regions, but you won't
be able
>> to
>> >>>>>> easily benefit from distributed scans to gather what you
want.
>> >>>>>>
>> >>>>>> Let say you want to split (time+login) with a salted key
and you
>> >> expect
>> >>>> to
>> >>>>>> be able to retrieve events from 20140429 pretty fast. Then
I would
>> >> split
>> >>>>>> input data among 10 "spans", spread over 10 regions and
10 RS (ie:
>> >>>> `$random
>> >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans
over
>> the
>> >> 10
>> >>>>>> span groups (<00>-20140429, <01>-20140429...)
and merge-sort
>> >> everything
>> >>>>>> until I've got all the expected results.
>> >>>>>>
>> >>>>>> So in term of performances this looks "a little bit" faster
than
>> your
>> >>>> 2^32
>> >>>>>> randomization.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>> >>>> static.void.dev@gmail.com>wrote:
>> >>>>>>
>> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot
spotting of
>> our
>> >>>>>>> time series data (20140501, 20140502...).  We can prefix
all of
>> the
>> >>>>>>> keys with 4 random bytes and then just skip these during
>> scanning. Is
>> >>>>>>> that correct? These *seems* like it will work but Im
questioning
>> the
>> >>>>>>> performance of this even if it does work.
>> >>>>>>>
>> >>>>>>> Also, is this available via the rest client, shell and/or
thrift
>> >>>> client?
>> >>>>>>>
>> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Adrien Mogenet
>> >>>>>> http://www.borntosegfault.com
>> >>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message