hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase performance tuning
Date Wed, 26 Mar 2008 10:15:10 GMT
I just finished implementing and testing the multithread version 
of my Java program that does the insert and I was able to get a 
System throughput of 6000 inserts/sec with 200 threads doing I/O.

What's interesting here is that initially there was no appreciable
performance
gain with multithreading when all the threads shared the same HTable
object.
But when each thread had its own instance of HTable, the performance
really
rocked! (4X improvement in throughput).

So, if the Mapper creates a new HTable instance then 5000 Maps = 5000
connections
but a connection to HMaster is short-lived i.e once the region server
information 
is retrieved it is cached at the client-side (in HTable object) and for
actual 
scanning and update operation the concerned region servers are
contacted. 
And yes this could be an overhead for too many mappers.

One technique that I can think of is to create an HTable pool (Apache's
Object Pool 
Framework can be used) and set it in the Map-Red job configuration and
set the pool
Size to sufficiently large number (~300 to 400). This way a mapper does
not need to
bother about creation of HTable objects.

-Ankur

-----Original Message-----
From: Andy Li [mailto:annndy.lee@gmail.com] 
Sent: Wednesday, March 26, 2008 12:30 PM
To: hbase-user@hadoop.apache.org
Subject: Re: HBase performance tuning

I have a sample to run MR and for each Map or Reducer, it will talk to
HBase via HTable class.

But before I put that online in the Wiki page, I need to confirm some
thing that may increase performance issue when the cluster grows.
Basically, the problem is that if I have 5000 Maps running and each of
them calls HTable or create a HTable instance that applies BatchUpdate,
will that create 5000 connections to HBase master?  I have only done it
in a smaller scale 100 Mappers and I don't see any problem, but it will
require profiling and some instrument on the system and code to figure
out.  It will be better to fork a new topic on this one.

-Andy

On Tue, Mar 25, 2008 at 10:26 PM, Goel, Ankur <Ankur.Goel@corp.aol.com>
wrote:

> A sample would be definitely good. Even better if we could Put it on 
> wiki for everyone else. If you don't have enough Spare cycles then do 
> let me know and I shall write the sample and put it back on wiki.
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Tuesday, March 25, 2008 7:24 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> Your insert is single-threaded?  At a minimum your program should be 
> multithreaded.  Randomize the keys on your data so that the inserts 
> are spread across your 9 regionservers.  Better if you spend a bit of 
> time and write a mapreduce job to do the insert (If you want a sample,

> write the list again and I'll put something together).
> St.Ack
>
> ANKUR GOEL wrote:
> > Hi Folks,
> >             I have a table with the following column families in the

> > schema
> >        {"referer_id:", "100"},  (Integer here is max length)
> >        {"url:","1500"},
> >        {"site:","500"},
> >        {"status:","100"}
> >
> > The common attributes for all the above column families are [max
> > versions: 1,  compression: NONE, in memory: false, block cache
> > enabled: true, max length: 100, bloom filter: none]
> >
> > [HBase Configuration]:
> >   - HDFS runs on 10 machine nodes with 8 GB RAM each and 4 CPU
cores.
> >   - HMaster runs on a different machine than NameNode.
> >   - There are 9 regionserves configured
> >   - Total DFS available  = 150 GB.
> >   - LAN speed in 100 Mbps
> >
> > I am trying to insert approx 4.8 million rows and the speed that I 
> > get
>
> > is around 1500 row inserts per sec (100,000 row inserts per min.).
> >
> > It takes around 50 min to insert all the seeds. The Java program 
> > that does the inserts uses buffered I/O to read the the data from a 
> > local file and runs on the same machine as the HMaster.To give you 
> > an idea of Java code that does the insert here is a snapshot of the
loop.
> >
> > while ((url = seedReader.readLine()) != null) {
> >      try {
> >        BatchUpdate update = new BatchUpdate(new 
> > Text(md5(normalizedUrl)));
> >        update.put(new Text("url:"), getBytes(url));
> >        update.put(new Text("site:"), getBytes(new
> URL(url).getHost()));
> >        update.put(new Text("status:"), getBytes(status));
> >        seedlist.commit(update); // seedlist is the HTable
> >       }
> > ....
> > ....
> >
> > Is there a way to tune HBase to achieve better I/O speeds ?
> > Ideally I would like to reduce the total insert time to less than 15

> > min i.e achieve an insert speed of around 4500 rows/sec or more.
> >
> > Thanks
> > -Ankur
> >
> >
>
>

Mime
View raw message