hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Whiting <je...@qualtrics.com>
Subject Struggling with Region Servers Running out of Memory
Date Mon, 29 Oct 2012 22:55:38 GMT
We have 6 region server given 10G of memory for hbase.  Each region server has an average of
about 
100 regions and across the cluster we are averaging about 100 requests / second with a pretty
even 
read / write load.  We are running cdh4 (0.92.1-cdh4.0.1, rUnknown)

I feel that looking over our load and our requests that the 10GB of memory should be enough
to 
handle the load and that we shouldn't really be pushing the the memory limits.

However what we are seeing is that our memory usage goes up slowly until the region server
starts 
sputtering due to gc collection issues and it will eventually get timed out by zookeeper and
be killed.

We'll see aborts like this in the log:
2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region 
server ds5.h1.ut1.qprod.net,60020,1351233245547: Unhandled exception: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing

ds5.h1.ut1.qprod.net,60020,1351233245547 as dead server
2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer

abort: loaded coprocessors are: []
2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region 
server ds5.h1.ut1.qprod.net,60020,1351233245547: 
regionserver:60020-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf

regionserver:60020-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf

received expired from ZooKeeper, aborting
2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer

abort: loaded coprocessors are: []

Which are "caused" by:
2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 29014ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 28121ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 31124ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 32209ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 32557ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 33741ms instead
of 
3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired


We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks in and really kills
the 
region server's performance.


We have the jvm metrics kicking out to ganglia and looking at jvm.RegionServer.metrics.memHeapUsedM

you can see that it will go up over time and eventually run out of memory.  I can also see
in 
hmaster:60010/master-status that the usedHeapMB just goes up and I can make a pretty educated
guess 
as to what server will go down next.  It will take several days to a week of continuous running

(after restarting a region server) before we have a potential problem.

Our next one to go will probably be ds6 and jmap -heap shows:
concurrent mark-sweep generation:
    capacity = 10398531584 (9916.8125MB)
    used     = 9036165000 (8617.558479309082MB)
    free     = 1362366584 (1299.254020690918MB)
    86.89847145248619% used

So we are using 86% of the 10GB heep allocated to the concurrent mark and sweep generation.
 Looking 
at ds6 in the web interface where has information about the a tasks it isn't running rpc stuff
it 
doesn't show any compactions or any background tasks happening. Nor is there any active rpc
call 
that are longer than 0 seconds (it seems to be handling the requests just fine).

At this point I feel somewhat lost as to how to debug the problem. I'm not sure what to do
next to 
figure out what is going on.  Any suggestions as to what to look for or debug where the memory
is 
being used? I can generate heap dumps via jmap (although it effectively kills the region server)
but 
I don't really know what to look for to see where the memory is going. I also have jmx setup
on each 
region server and can connect to it that way.

Thanks,
~Jeff

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Mime
View raw message