hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Bishop <jbishop....@gmail.com>
Subject RE: more regionservers does not improve performance
Date Fri, 12 Oct 2012 06:34:26 GMT

Thanks  for the reply.

Actually, I am using MD5 hashing to evenly spread the keys among the
splits, so I don’t believe there is any hotspot. In fact, when I monitory
the web UI for HBase I see a very even load on all the regionservers.


Sent from my Windows 8 PC <http://windows.microsoft.com/consumer-preview>

 *From:* Pankaj Misra <pankaj.misra@impetus.co.in>
*Sent:* Thursday, October 11, 2012 8:24:32 PM
*To:* user@hbase.apache.org
*Subject:* RE: more regionservers does not improve performance

Hi Jonathan,

What seems to me is that, while doing the split across all 40 mappers, the
keys are not randomized enough to leverage multiple regions and the
pre-split strategy. This may be happening because all the 40 mappers may be
trying to write onto a single region for sometime, making it a HOT region,
 till the key falls into another region, and then the other region becomes
a HOT region hence you may seeing a high impact of compaction cycles
reducing your throughput.

Are the keys incremental? Are the keys randomized enough across the splits?

Ideally when all 40 mappers are running you should see all the regions
being filled up in parallel for maximum throughput. Hope it helps.

Thanks and Regards
Pankaj Misra

From: Jonathan Bishop [jbishop.rwc@gmail.com]
Sent: Friday, October 12, 2012 5:38 AM
To: user@hbase.apache.org
Subject: more regionservers does not improve performance


I am running a MR job with 40 simultaneous mappers, each of which does puts
to HBase. I have ganged up the puts into groups of 1000 (this seems to help
quite a bit) and also made sure that the table is pre-split into 100
regions, and that the row keys are randomized using MD5 hashing.

My cluster size is 10, and I am allowing 4 mappers per tasktracker.

In my MR job I know that the mappers are able to generate puts much faster
than the puts can be handled in hbase. In other words if I let the mappers
run without doing hbase puts then everything scales as you would expect
with the number of mappers created. It is the hbase puts which seem to be
the bottleneck.

What is strange is that I do not get much run time improvement by
increasing the number regionservers beyond about 4. Indeed, it seems that
the system runs slower with 8 regionservers than with 4.

I have added the following in hbase-env.sh hoping this would help... (from
the book HBase in Action)

export HBASE_OPTS="-Xmx8g"
export HBASE_REGIONSERVER_OPTS="-Xmx8g -Xms8g -Xmn128m -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"

# Uncomment below to enable java garbage collection logging in the .out
export HBASE_OPTS="${HBASE_OPTS} -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:${HBASE_HOME}/logs/gc-hbase.log"

Monitoring hbase through the web ui I see that there are pauses for
flushing, which seems to run pretty quickly, and for compacting, which
seems to take somewhat longer.

Any advice for making this run faster would be greatly appreciated.
Currently I am looking into installing Ganglia to better monitory my
cluster, but yet to have that running.

I suspect an I/O issue as the regionservers do not seem terribly loaded.




Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.

Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
Interoperable Systems’ available at http://lf1.me/0E/.

NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message