hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 茅旭峰 <m9s...@gmail.com>
Subject One of the regionserver aborted, then the master shut down itself
Date Mon, 14 Mar 2011 13:43:28 GMT
Hi,

We are trying a small hbase environment, including 2-node masters, 4-node
regionservers, and 3-node zookeeper cluster.
Now we're doing stress tests, using cells of bigger size(4mb - 15mb). What
we run into was,

1. one of the regionserver(10.241.67.33) aborted due to OOM, with log
====
2011-03-14 20:35:28,235 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting compaction on region
table2,SXGdRIcvExFco4zZIPwhcKjjYlo=,1300096043004.8b00c3cb40e3e827f46aaaeeafce044d.
2011-03-14 20:35:28,236 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up
with memory above low water.
2011-03-14 20:35:28,236 ERROR
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed
for entry
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@2ca74909

......

2011-03-14 20:36:40,396 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up
with memory above low water.
2011-03-14 20:36:40,396 ERROR
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed
for entry
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@3b97477e
......
2011-03-14 20:41:36,541 INFO org.apache.hadoop.hbase.regionserver.HRegion:
aborted compaction on region
table2,Ig92Q4M9wK3GagLmUMDhHvlQBr0=,1300097018887.62a5a22f09f37e099679b2f6a12490b6.
after 1mins, 39sec
2011-03-14 20:41:36,541 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
region we are not serving; 944c234d8f64eccd0833d86349623309
2011-03-14 20:41:36,544 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=255,
usedHeap=3982, maxHeap=3983): Uncaught exception in service thread
regionserver60020.compactor
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1093)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
        at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
......
2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closing leases
2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closed leases
2011-03-14 20:41:38,085 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x32e9a2f2678000a
2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ZooKeeper: Session:
0x32e9a2f2678000a closed
2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-03-14 20:41:38,100 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12e9a2e21be0011 closed
2011-03-14 20:41:38,101 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-03-14 20:41:38,102 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2011-03-14 20:41:38,117 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting;
hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
2011-03-14 20:41:38,117 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook
2011-03-14 20:41:38,117 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook
thread.
2011-03-14 20:41:38,220 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished.

====

2.the master shutdown itself, with log

====
2011-03-14 20:41:03,991 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Cache hit for row <> in tableName .META.: location server cloud134:60020,
location region name .META.,,1.1028785192
2011-03-14 20:41:35,415 FATAL org.apache.hadoop.hbase.master.HMaster: Remote
unexpected exception
java.io.IOException: Call to /10.241.67.33:60020 failed on local exception:
java.io.EOFException
        at
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:806)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy6.closeRegion(Unknown Source)
        at
org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:589)
        at
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1093)
        at
org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1672)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
2011-03-14 20:41:35,417 INFO org.apache.hadoop.hbase.master.HMaster:
Aborting
2011-03-14 20:41:35,421 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0), message=Connection refused,
region=1bd40fb0466d9b0b52f93f9af41b3a8e
2011-03-14 20:41:35,424 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0), message=Connection refused,
region=1c564f7c72355b78369aa1467d9852e5
2011-03-14 20:41:35,425 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0), message=Connection refused,
region=1cabffd7a9213938ac1181f05c19f2d7
.....
2011-03-14 20:41:36,013 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0), message=Connection refused,
region=f694004eeb2dea68a65ec3b378764805
2011-03-14 20:41:36,017 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
usedHeap=0, maxHeap=0), message=Connection refused,
region=fa3f2e8c175d794aa8b18e3e8995d382
2011-03-14 20:41:36,025 INFO
org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
cloud135:60000.timeoutMonitor exiting
2011-03-14 20:41:36,093 DEBUG org.apache.hadoop.hbase.master.HMaster:
Stopping service threads
2011-03-14 20:41:36,093 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60000
2011-03-14 20:41:36,093 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
cloud135:60000-CatalogJanitor exiting
2011-03-14 20:41:36,094 INFO org.apache.hadoop.hbase.master.HMaster$1:
cloud135:60000-BalancerChore exiting
2011-03-14 20:41:36,094 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server listener on 60000
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 4 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.HMaster:
Stopping infoServer
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 6 on 60000: exiting
2011-03-14 20:41:36,096 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60000: exiting
2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.LogCleaner:
master-cloud135:60000.oldLogCleaner exiting
2011-03-14 20:41:36,096 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:60010
2011-03-14 20:41:36,222 DEBUG
org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker
org.apache.hadoop.hbase.catalog.CatalogTracker@62770d2e
2011-03-14 20:41:36,323 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x12e9a2e21be0038
2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12e9a2e21be0038 closed
2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ZooKeeper: Session:
0x22e9a2e22390033 closed
2011-03-14 20:41:36,340 INFO org.apache.hadoop.hbase.master.HMaster: HMaster
main thread exiting

====

the died regionserver is cloud140(10.241.67.33). What confusing us is why
the death of regionserver would cause the master shut down itself.
On the other hand, the standby master started, but we still could not
manipulate the table because too many WALs to recover.

I guess it relates to the previous thread 'HMaster startup is very slow, and
always run into out-of-memory issue' I submited.

I guess both issues are about gc tuning, but I can only get guidelines like
use CMS policy, any others? maybe particularly for
bigger cell.

Thanks and regards,

Mao Xu-Feng

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message