hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Kellerman <...@powerset.com>
Subject RE: HBase performance tuning
Date Fri, 28 Mar 2008 15:52:14 GMT
I want to encourage all users to publish their results on the HBase Wiki, even if you are only
'kicking the tires' of HBase. One place to do this is on the PoweredBy page. We are interested
in seeing your cluster configuration, what version of HBase you are running, your schema (if
that is something you can share) and brief performance numbers you have achieved using Map/Reduce
or multi-threaded clients. Also, if you are using the REST or Thrift APIs rather than the
native Java API, that would be of interest too.

If you do a detailed performance test, feel free to create a new page on the Wiki and point
to it from the front page. When we have a number of performance links we will probably create
a new page and move the links there, but for now such a page would only have a couple of links
on it.

Our 'theme' for release 0.2.0 is robustness and scalability, but we plan to address performance
in 0.3.0. We have picked most of the low hanging fruit in performance, but if anyone would
like to do more detailed analysis and contribute patches related to hot spots they find, we
would welcome that as well.

We don't have many contributors at this time, and would welcome outside contributions as there
is only so much we can do.

Making HBase do what you need it to do is foremost in our minds, which is why we have chosen
to work on rubustness and scalability in the short term (if it falls over when you try to
use it, it really doesn't matter how fast it is).

Thanks for your patience and willingness to try HBase even though it is still immature (only
about 2 person-years of effort have been invested in it to date).

---
Jim Kellerman, Senior Engineer; Powerset


> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Friday, March 28, 2008 5:09 AM
> To: hbase-user@hadoop.apache.org
> Subject: RE: HBase performance tuning
>
> Ok, So I picked up and modified the code for my use and tried
> it with different configurations varying the no. of reduces
> in each run (10, 20, 40, 80,
> 200) and
> the best throughput I could get (with 200 reducers) was 4306
> inserts/sec.
> The total runtime being 17 min. for 4.38 million seeds.
>
> Using my threaded client running 200 threads I managed same
> number of inserts in 12 min.
>
> Looks like Map-Red insert is slower than our regular threaded insert.
> Can gain performance via any other tweak ?
> If not then is there any reasonable scope of performance
> improvement of Hbase via code optimization ?
>
> (I wouldn't mind taking a deep dive into the code to optimize
> core HBase memory
>   structures and contribute to HBase)
>
> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net]
> Sent: Thursday, March 27, 2008 12:05 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase performance tuning
>
> I just posted EXAMPLE code to the hbase MR wiki page:
> http://wiki.apache.org/hadoop/Hbase/MapReduce
> St.Ack
>
>
>
>
> Naama Kraus wrote:
> > Hi,
> >
> > A sample MapReduce for an insert would be interesting to me also !
> >
> > Naama
> >
> > On Tue, Mar 25, 2008 at 3:54 PM, stack <stack@duboce.net> wrote:
> >
> >
> >> Your insert is single-threaded?  At a minimum your program
> should be
> >> multithreaded.  Randomize the keys on your data so that
> the inserts
> >> are spread across your 9 regionservers.  Better if you
> spend a bit of
>
> >> time and write a mapreduce job to do the insert (If you want a
> >> sample, write the list again and I'll put something together).
> >> St.Ack
> >>
> >> ANKUR GOEL wrote:
> >>
> >>> Hi Folks,
> >>>             I have a table with the following column
> families in the
>
> >>> schema
> >>>        {"referer_id:", "100"},  (Integer here is max length)
> >>>        {"url:","1500"},
> >>>        {"site:","500"},
> >>>        {"status:","100"}
> >>>
> >>> The common attributes for all the above column families are [max
> >>> versions: 1,  compression: NONE, in memory: false, block cache
> >>> enabled: true, max length: 100, bloom filter: none]
> >>>
> >>> [HBase Configuration]:
> >>>   - HDFS runs on 10 machine nodes with 8 GB RAM each and 4 CPU
> cores.
> >>>   - HMaster runs on a different machine than NameNode.
> >>>   - There are 9 regionserves configured
> >>>   - Total DFS available  = 150 GB.
> >>>   - LAN speed in 100 Mbps
> >>>
> >>> I am trying to insert approx 4.8 million rows and the
> speed that I
> >>> get is around 1500 row inserts per sec (100,000 row inserts per
> min.).
> >>>
> >>> It takes around 50 min to insert all the seeds. The Java program
> >>> that does the inserts uses buffered I/O to read the the
> data from a
> >>>
> >> local
> >>
> >>> file and runs on the same machine as the HMaster.To give
> you an idea
>
> >>> of Java code that does the insert here is a snapshot of the loop.
> >>>
> >>> while ((url = seedReader.readLine()) != null) {
> >>>      try {
> >>>        BatchUpdate update = new BatchUpdate(new
> >>> Text(md5(normalizedUrl)));
> >>>        update.put(new Text("url:"), getBytes(url));
> >>>        update.put(new Text("site:"), getBytes(new
> URL(url).getHost()));
> >>>        update.put(new Text("status:"), getBytes(status));
> >>>        seedlist.commit(update); // seedlist is the HTable
> >>>       }
> >>> ....
> >>> ....
> >>>
> >>> Is there a way to tune HBase to achieve better I/O speeds ?
> >>> Ideally I would like to reduce the total insert time to
> less than 15
>
> >>> min i.e achieve an insert speed of around 4500 rows/sec or more.
> >>>
> >>> Thanks
> >>> -Ankur
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 7.5.519 / Virus Database: 269.22.1/1346 - Release
> Date: 3/27/2008 10:03 AM
>
>

No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.22.1/1348 - Release Date: 3/28/2008 10:58 AM


Mime
View raw message