hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Whiting <je...@qualtrics.com>
Subject Re: Struggling with Region Servers Running out of Memory
Date Fri, 02 Nov 2012 00:44:31 GMT
Ok so I'm looking through the code.  It looks like in HBaseServer.java it will create a 
replicationQueue if hbase.regionserver.replication.handler.count > 0.  We haven't changed
that so 
the default is 3.  The replicationQueue is then shared with handlers.

Then in processData(byte[] buf) if it is a replication call it puts it in the replicationQueue.

So when cluster A is replicating to cluster B and cluster B isn't keeping up does the 
replicationQueue just fill up until it runs out of memory?  It seems like it should rate limit
only send new edits once they old ones have executed.  I'm a little hazy when processData
is called 
and how it fits in the whole replication pipeline.

Since the region servers are just replaying wal logs to do the replication it seems like the
footprint could be made to be very minimal.


On 11/1/2012 5:44 PM, Jeff Whiting wrote:
> So this is some of what I'm seeing as I go through the profiles:
> (a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache
>     This looks like it is the block cache and we aren't having any problems with that...
> (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- 
> java.util.concurrent.ConcurrentHashMap$Segment[]
>     It looks like it belongs to the member variable "onlineRegions" which has a member
> "segments".
>     I'm guessing this is the memstores that hbase is currently holding onto.
> (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server
>    This is the one that keeps growing and not shrinking that is causing us to run out
of memory.  
> However the cause isn't immediately clear like the other 2 in MAT.
>   These seem to be the references to the LinkedBlockingQueue (you'll need a wide monitor
to read 
> it well):
> Class Name | Shallow Heap | Retained Heap
> ---------------------------------------------------------------------------------------------------------------------------------------------------------

> java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 |           80 | 4,616,431,568
> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3c70 REPL
IPC Server 
> handler 2 on 60020 Thread| 192 | 384,392
> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3d30 REPL
IPC Server 
> handler 1 on 60020 Thread| 192 | 384,392
> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3df0 REPL
IPC Server 
> handler 0 on 60020 Thread| 192 | 205,976
> |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 
> 0x2aaab392dbe0                                 |          240 |         3,968
> ---------------------------------------------------------------------------------------------------------------------------------------------------------

> So it looks like it is shared between the myCallQueue and the replicationQueue.  JProfiler
> showing the same thing.  I'm having a hard time figuring out much more.
> (d) 977MB -- In other (no common root)
>     This just seems to be other stuff going on in the region server but I'm not really
> about it...as I don't think it is the culprit.
> Overall it looks like it has to do with replication.  So this cluster is in the middle
of an 
> replication chain A -> B -> C where this cluster is B.  So can we tell if it is
running out of 
> memory because it is being replicated too?  Or because it is trying to replicate somewhere
> Thanks,
> ~Jeff
> On 10/30/2012 11:39 PM, Stack wrote:
>> On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <jeffw@qualtrics.com> wrote:
>>> However what we are seeing is that our memory usage goes up slowly until the
>>> region server starts sputtering due to gc collection issues and it will
>>> eventually get timed out by zookeeper and be killed.
>> Hey Jeff.  You have GC logging enabled?  Might not tell you more than
>> you already know, that something is retaining more and more objects
>> over time.   You have a dumped heap?  What have you used to poke at
>> it?  You generally want to find the objects that have the deepest size
>> (Not all profilers let you do this though).  This is usually enough to
>> give you a clue.
>> Anything particular about the character of your load?  Ram asks if any
>> big cells in the mix?
>> St.Ack
>>> At this point I feel somewhat lost as to how to debug the problem. I'm not
>>> sure what to do next to figure out what is going on.  Any suggestions as to
>>> what to look for or debug where the memory is being used? I can generate
>>> heap dumps via jmap (although it effectively kills the region server) but I
>>> don't really know what to look for to see where the memory is going. I also
>>> have jmx setup on each region server and can connect to it that way.
>>> Thanks,
>>> ~Jeff
>>> -- 
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com

Jeff Whiting
Qualtrics Senior Software Engineer

View raw message