hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Li" <annndy....@gmail.com>
Subject Re: HBase performance tuning
Date Wed, 26 Mar 2008 06:59:36 GMT
I have a sample to run MR and for each Map or Reducer, it will talk to HBase
via HTable class.

But before I put that online in the Wiki page, I need to confirm some thing
that may increase performance
issue when the cluster grows.  Basically, the problem is that if I have 5000
Maps running and each of them
calls HTable or create a HTable instance that applies BatchUpdate, will that
create 5000 connections to HBase
master?  I have only done it in a smaller scale 100 Mappers and I don't see
any problem, but it will require profiling and
some instrument on the system and code to figure out.  It will be better to
fork a new topic on this one.

-Andy

On Tue, Mar 25, 2008 at 10:26 PM, Goel, Ankur <Ankur.Goel@corp.aol.com>
wrote:

> A sample would be definitely good. Even better if we could
> Put it on wiki for everyone else. If you don't have enough
> Spare cycles then do let me know and I shall write the sample
> and put it back on wiki.
>
> Thanks
> -Ankur
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Tuesday, March 25, 2008 7:24 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> Your insert is single-threaded?  At a minimum your program should be
> multithreaded.  Randomize the keys on your data so that the inserts are
> spread across your 9 regionservers.  Better if you spend a bit of time
> and write a mapreduce job to do the insert (If you want a sample, write
> the list again and I'll put something together).
> St.Ack
>
> ANKUR GOEL wrote:
> > Hi Folks,
> >             I have a table with the following column families in the
> > schema
> >        {"referer_id:", "100"},  (Integer here is max length)
> >        {"url:","1500"},
> >        {"site:","500"},
> >        {"status:","100"}
> >
> > The common attributes for all the above column families are [max
> > versions: 1,  compression: NONE, in memory: false, block cache
> > enabled: true, max length: 100, bloom filter: none]
> >
> > [HBase Configuration]:
> >   - HDFS runs on 10 machine nodes with 8 GB RAM each and 4 CPU cores.
> >   - HMaster runs on a different machine than NameNode.
> >   - There are 9 regionserves configured
> >   - Total DFS available  = 150 GB.
> >   - LAN speed in 100 Mbps
> >
> > I am trying to insert approx 4.8 million rows and the speed that I get
>
> > is around 1500 row inserts per sec (100,000 row inserts per min.).
> >
> > It takes around 50 min to insert all the seeds. The Java program that
> > does the inserts uses buffered I/O to read the the data from a local
> > file and runs on the same machine as the HMaster.To give you an idea
> > of Java code that does the insert here is a snapshot of the loop.
> >
> > while ((url = seedReader.readLine()) != null) {
> >      try {
> >        BatchUpdate update = new BatchUpdate(new
> > Text(md5(normalizedUrl)));
> >        update.put(new Text("url:"), getBytes(url));
> >        update.put(new Text("site:"), getBytes(new
> URL(url).getHost()));
> >        update.put(new Text("status:"), getBytes(status));
> >        seedlist.commit(update); // seedlist is the HTable
> >       }
> > ....
> > ....
> >
> > Is there a way to tune HBase to achieve better I/O speeds ?
> > Ideally I would like to reduce the total insert time to less than 15
> > min i.e achieve an insert speed of around 4500 rows/sec or more.
> >
> > Thanks
> > -Ankur
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message