hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Garbage collection issues
Date Fri, 18 May 2012 12:19:31 GMT
Head over to Cloudera's site and look at a couple of blog posts from Todd Lipcon. 
Also look at MSLABs .

On a side note... you don't have a lot of memory to play with...

On May 18, 2012, at 6:54 AM, Simon Kelly wrote:

> Hi
> Firstly, let me complement the Hbase team on a great piece of software. We're running
a few clusters that are working well but we're really struggling with a new one I'm trying
to setup and could use a bit of help. I have read as much as I can but just can't seem to
get it right.
> The difference between this cluster the others is that this one's load is 99% writes.
Each write contains about 40 columns to a single table and column family and the total data
size varies between about 1 & 2K. The load per server varies between 20  and 90 requests
per second at different times of the day. The row keys are UUID's so are uniformly distributed
across the (currently 60) regions. 
> The problem seems to be that after some time a GC cycle takes longer that expected one
of the regionservers and the master kills the regionserver.
> This morning I ran the system up till the first regionserver failure and recorded the
data with Ganglia. I have attached the following ganglia graphs:
> hbase.regionserver.compactionQueueSize
> hbase.regionserver.memstoreSizeMB
> requests_per_minute (to the service that calls hbase)
> request_processing_time (of the service that calls hbase)
> Any assistance would be greatly appreciated. I did have GC logging on so have access
to all that data too.
> Best regards
> Simon Kelly
> Cluster details
> ----------------------
> Its running on 5 machines with the following specs:
> CPUs: 4 x 2.39 GHz
> RAM: 8 GB
> Ubuntu 10.04.2 LTS
> The Hadoop cluster (version 1.0.1, r1243785) is running over all the machines that has
8TB of capacity (60% unused). On top of that is Hbase version 0.92.1, r1298924. All the servers
run Hadoop datanodes and Hbase regionservers. One server hosts the Hadoop primary namenode
and the Hbase master. 3 servers form the Zookeeper quorum.
> The Hbase config is as follows:
> HBASE_OPTS="-Xmn128m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseParNewGC
> hbase.rootdir : hdfs://server1:8020/hbase
> hbase.cluster.distributed : true
> hbase.zookeeper.property.clientPort : 2222
> hbase.zookeeper.quorum : server1,server2,server3
> zookeeper.session.timeout : 30000
> hbase.regionserver.maxlogs : 16
> hbase.regionserver.handler.count : 50
> hbase.regionserver.codecs : lzo
> hbase.master.startup.retainassign : false
> hbase.hregion.majorcompaction : 0
> (for the benefit of those without the attachements I'll describe the graphs:
> 0900 - system starts
> 1010 - memstore reaches 1.2GB and flushes to 500MB, a few hbase compactions happen and
a slight increase in request_processing_time
> 1040 - memstore reaches 1.0GB and flushes to 500MB (no hbase compactions)
> 1110 - memstore reaches 1.0GB and flushes to 300MB, a few more hbase compactions happen
and a slightly larger increase in request_processing_time
> 1200 - memstore reaches 1.3GB and flushes to 200MB, more hbase compactions and increase
in request_processing_time
> 1230 - hbase logs for server1 record: We slept 13318ms instead of 3000ms and regionserver1
is killed by master, request_processing_time goes way up
> 1326 - hbase logs for server3 record: We slept 77377ms instead of 3000ms and regionserver2
is killed by master
> )

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message