hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Thu, 27 Mar 2008 15:08:23 GMT
I am crawling the web indeed, but only the sites
that are present in my seedlist. The crawler used
here is heritrix 2.0 -
http://webteam.archive.org/confluence/display/Heritrix/2.0.0.

I developed a Heritrix specific HBase writer that can be integrated with
Heritrix to write the crawled content directly into Hbase.

-Ankur


-----Original Message-----
From: stack [mailto:stack@duboce.net] 
Sent: Thursday, March 27, 2008 8:04 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

Looks like you are crawling the web.  What crawler are you using?   
Could you write direct into hbase from the crawler?
St.Ack

Goel, Ankur wrote:
> Thanks for the explanation Stack. Using my threaded client I got a 
> throughput of 6000 inserts/sec. Let me use and modify the code you  
> posted on wiki to see if I can get a better throughput.
> I'll write the list again once I have some performance data.
>
> -Ankur
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Wednesday, March 26, 2008 9:42 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> Goel, Ankur wrote:
>   
>> ...
>>
>> One technique that I can think of is to create an HTable pool 
>> (Apache's Object Pool Framework can be used) and set it in the 
>> Map-Red
>>     
>
>   
>> job configuration and set the pool Size to sufficiently large number 
>> (~300 to 400). This way a mapper does not need to bother about 
>> creation of HTable objects.
>>   
>>     
>
> Each mapper runs in a new JVM instance.  There is no context shared by

> mappers into which you could put your pool instance.
>
> Ideally, you'd run an HBase client per RegionServer, or better, a 
> client
>
> per Region.  The latter is hard-to-do because tables during bulk 
> uploads
>
> are in a state of flux with the numbers of regions changing
frequently.
>
> A threaded client like yours could do as the TableInputFormat does 
> under
>
> the HBase mapred package, querying first to find the list of Regions, 
> and you could put up that many clients.   As the upload progressed, 
> you'd re-ask on a period for the number of regions and adjust the 
> number
>
> of clients accordingly.
>
> You also want clients to be somewhat long-lived so that you're not 
> fetching region locations every time you want to do an insert; rather,

> the client uses its region-cache.  In your threaded uploader, this 
> isn't
>
> hard to do.  But in a MR job with a new JVM created to run every task,

> one suggestion would be to do the insert in the reduce step (See the 
> TableReduce under mapred package).  Set the number of reducers to the 
> number of RegionServers or an estimate of the number of Regions (Or 
> run multiple jobs gradually stepping up the number of reducers).  The 
> map would sort the input so commits would be going in serially.
>
> Let me put up sample code that does the latter in a little while.
>
> Bulk upload is an interesting problem.  I suggest MR as a 
> quick-and-dirty means of putting up many clients and as a direction 
> that
>
> will likely scale but it it lacks finesse.
>
> St.Ack
>
>   


Mime
View raw message