hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Read access pattern
Date Tue, 30 Apr 2013 16:40:54 GMT
bq. The downside that I see, is the bucket_number that we have to 
maintain both at time or reading/writing and update it in case of 
cluster restructuring.

I agree that this maintenance can be painful. However, Phoenix 
(https://github.com/forcedotcom/phoenix) now supports salting, 
automating this maintenance.  If you want to salt your table, just add a 
SALT_BUCKETS = <n> property at the end of your DDL statement, where <n> 
is the total number of buckets (up to a max of 256).  For example:

CREATE TABLE t (date_time DATE NOT NULL, event_id CHAR(15) NOT NULL
     CONSTRAINT pk PRIMARY KEY (date_time, event_id))

This will add one byte at the beginning of your row key whose value is 
formed by hashing the row key and mod-ing with 10. This will 
automatically be done for any upsert and queries will automatically be 
distributed and the results combined as expected.



On 04/30/2013 09:17 AM, Shahab Yunus wrote:
> Well those are *some* words :) Anyway, can you explain a bit in detail that
> why you feel so strongly about this design/approach? The salting here is
> not the only option mentioned and static hashing can be used as well. Plus
> even in case of salting, wouldn't the distributed scan take care of it? The
> downside that I see, is the bucket_number that we have to maintain both at
> time or reading/writing and update it in case of cluster restructuring.
> Thanks,
> Shahab
> On Tue, Apr 30, 2013 at 11:57 AM, Michael Segel
> <michael_segel@hotmail.com>wrote:
>> Geez that's a bad article.
>> Never salt.
>> And yes there's a difference between using a salt and using the first 2-4
>> bytes from your MD5 hash.
>> (Hint: Salts are random. Your hash isn't. )
>> Sorry to be-itch but its a bad idea and it shouldn't be propagated.
>> On Apr 29, 2013, at 10:17 AM, Shahab Yunus <shahab.yunus@gmail.com> wrote:
>>> I think you cannot use the scanner simply to to a range scan here as your
>>> keys are not monotonically increasing. You need to apply logic to
>>> decode/reverse your mechanism that you have used to hash your keys at the
>>> time of writing. You might want to check out the SemaText library which
>>> does distributed scans and seem to handle the scenarios that you want to
>>> implement.
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>> On Mon, Apr 29, 2013 at 11:03 AM, <ricla@laposte.net> wrote:
>>>> Hi,
>>>> I have a rowkey defined by :
>>>>         getMD5AsHex(Bytes.toBytes(myObjectId)) + String.format("%19d\n",
>>>> (Long.MAX_VALUE - changeDate.getTime()));
>>>> How could I get the previous and next row for a given rowkey ?
>>>> For instance, I have the following ordered keys :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370673172227807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807
>>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674987271807
>>>> If I choose the rowkey :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807, what would be the
>>>> correct scan to get the previous and next key ?
>>>> Result would be :
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807
>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807
>>>> Thank you !
>>>> R.
>>>> Une messagerie gratuite, garantie à vie et des services en plus, ça vous
>>>> tente ?
>>>> Je crée ma boîte mail www.laposte.net

View raw message