hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Thu, 27 Mar 2008 11:22:49 GMT
Thanks for the explanation Stack. Using my threaded client
I got a throughput of 6000 inserts/sec. Let me use and modify
the code you  posted on wiki to see if I can get a better 
I'll write the list again once I have some performance data.


-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Wednesday, March 26, 2008 9:42 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Goel, Ankur wrote:
> ...
> One technique that I can think of is to create an HTable pool 
> (Apache's Object Pool Framework can be used) and set it in the Map-Red

> job configuration and set the pool Size to sufficiently large number 
> (~300 to 400). This way a mapper does not need to bother about 
> creation of HTable objects.

Each mapper runs in a new JVM instance.  There is no context shared by 
mappers into which you could put your pool instance.

Ideally, you'd run an HBase client per RegionServer, or better, a client

per Region.  The latter is hard-to-do because tables during bulk uploads

are in a state of flux with the numbers of regions changing frequently.

A threaded client like yours could do as the TableInputFormat does under

the HBase mapred package, querying first to find the list of Regions, 
and you could put up that many clients.   As the upload progressed, 
you'd re-ask on a period for the number of regions and adjust the number

of clients accordingly.

You also want clients to be somewhat long-lived so that you're not 
fetching region locations every time you want to do an insert; rather, 
the client uses its region-cache.  In your threaded uploader, this isn't

hard to do.  But in a MR job with a new JVM created to run every task, 
one suggestion would be to do the insert in the reduce step (See the 
TableReduce under mapred package).  Set the number of reducers to the 
number of RegionServers or an estimate of the number of Regions (Or run 
multiple jobs gradually stepping up the number of reducers).  The map 
would sort the input so commits would be going in serially.

Let me put up sample code that does the latter in a little while.

Bulk upload is an interesting problem.  I suggest MR as a 
quick-and-dirty means of putting up many clients and as a direction that

will likely scale but it it lacks finesse.


View raw message