hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cosmin Lehene <cleh...@adobe.com>
Subject Re: using date as key
Date Mon, 28 Mar 2011 07:04:02 GMT
Lior, 

If you already know the key distribution you can create all the regions in advance. 
Are you inserting a single day or multiple days?

5X is a good improvement. Here are some more hints:

Hadoop does a sort of the reduce keys before the actual reduce phase. This means that if your
keys start with the date you'll get all reducers inserting for a consecutive days.  If you
need avoid hot regions and the key component of your date_key is evenly distributed among
days, then you can emit key_date from mappers instead of date_key and then reassemble them
correctly in reducers. This way you'll have an even distribution of inserts on your pre-created
regions. 

Cosmin



On Mar 27, 2011, at 8:00 PM, Lior Schachter wrote:

> Hi,
> Last week I consulted he forum about hbase insertion optimization when  the
> key format is : date_key.
> This key format is very good for efficient scans but creates hotspot a
> single region when inserting millions of rows.
> 
> I would like to share and get a feedback on the solution we found:
> 1. insert one day. after region split see the start-end row of each server
> (this is done one to see keys distribution).
> 2. now, before inserting a day create programmatically empty regions with
> the start-end key from 1 (by creating rows in the meta-table).
> Assuming row key-distribution of a day does not change dramatically, the
> reduces can insert to multiple regions (thus avoiding hotspotting).
> 
> Applying this method improved insert performance by a factor of 5 or so.
> 
> Lior


Mime
View raw message