hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Batyrshin <0x62...@gmail.com>
Subject hbase:meta not online
Date Tue, 13 Aug 2019 17:11:01 GMT
 Hello,
We’ve got problem with hbase:meta that cause several servers to die.
Our version is HBase-1.4.8

prod006 - flush hbase:meta and move it: 

Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO  [prod006,60020,1564942121270_ChoreService_1]
regionserver.HRegionServer: prod006,60020,1564942121270-MemstoreFlusherChore requesting flush
of hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs after random
delay 226,666 ms
Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO  [MemStoreFlusher.1] regionserver.HRegion:
Finished memstore flush of ~2.99 KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740
in 24ms, sequenceid=323883, compaction requested=false
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO  [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1]
regionserver.HStore: Closed info
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO  [StoreCloserThread-hbase:meta,,1.1588230740-1]
regionserver.HStore: Closed info
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO  [RS_CLOSE_META-prod006:60020-0]
regionserver.HRegion: Closed hbase:meta,,1.1588230740
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO  [RS_CLOSE_REGION-prod006:60020-0]
regionserver.HRegion: Closed hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO  [RS_CLOSE_REGION-prod006:60020-0]
regionserver.HRegionServer: Adding moved region record: d3d9ad37f8506c02a84458121965c0b0 to
prod023,60020,1565701089060 as of 147

At this time at master we have this:

Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.ZK.Worker-pool5-t1878]
master.RegionStates: Transition {1588230740 state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270}
to {158823074
0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.-pool3-t178] master.AssignmentManager:
Setting node as OFFLINED in ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
STARTKEY => '', ENDKEY => ''}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.-pool3-t178] master.RegionStates:
Transition {1588230740 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
to {1588230740 state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.ZK.Worker-pool5-t1879]
master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871,
server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984,
server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.-pool3-t178] master.AssignmentManager:
Assigning hbase:meta,,1.1588230740 to prod023,60020,1565701089060
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.-pool3-t178] master.RegionStates:
Transition {1588230740 state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
to {1588230740 state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO  [AM.-pool3-t179] master.AssignmentManager:
Setting node as OFFLINED in ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0,
NAME => 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.', STARTKEY =>
'', ENDKEY => ''}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO  [AM.-pool3-t179] master.RegionStates:
Transition {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270}
to {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO  [AM.-pool3-t179] master.AssignmentManager:
Assigning hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO  [AM.-pool3-t179] master.RegionStates:
Transition {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
to {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN  [AM.-pool3-t179] master.AssignmentManager:
Failed assignment of hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060,
waiting a little before trying on the same region server try=1 of 10
Aug 13 15:58:11 prod001 hbase[14953]: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server prod023,60020,1565701089060
is not running yet
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.lang.Thread.run(Thread.java:748)
Aug 13 15:58:11 prod001 hbase[14953]: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server prod023,60020,1565701089060
is not running yet
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
Aug 13 15:58:11 prod001 hbase[14953]:         ... 1 more
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO  [AM.ZK.Worker-pool5-t1882]
master.RegionStates: Transition {1588230740 state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
to {1588230740 state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO  [AM.-pool3-t179] master.AssignmentManager:
Assigning hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO  [AM.ZK.Worker-pool5-t1883]
master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244,
server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391,
server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO  [AM.ZK.Worker-pool5-t1884]
master.RegionStates: Transition {1588230740 state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060}
to {1588230740 state=OPEN, ts=1565701091666, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO  [AM.ZK.Worker-pool5-t1884]
coordination.ZkOpenRegionCoordination: Handling OPENED of 1588230740 from prod001,60000,1565179968855;
deleting unassigned node
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO  [AM.ZK.Worker-pool5-t1886]
master.RegionStates: Offlined 1588230740 from prod006,60020,1564942121270
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO  [AM.ZK.Worker-pool5-t1887]
master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391,
server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 state=OPEN, ts=1565701091805,
server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO  [AM.ZK.Worker-pool5-t1889]
master.RegionStates: Offlined d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270


From this moment hbase:meta and hbase:namespace is moved to prod023 from prod006

But some other servers still try to get hbase:meta from prod006:

Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
client.RpcRetryingCaller: Call exception, tries=10, retries=350, started=38462 ms ago, cancelled=false,
msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online
on prod006,60020,1564942121270
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 16:02:24 prod021 hbase[103657]:  row ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44'
on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=prod023,60020,1565701089060,
seqNum=0

….

Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
client.RpcRetryingCaller: Call exception, tries=68, retries=350, started=1194772 ms ago, cancell
ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not
online on prod006,60020,1564942121270
Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
regionserver.SplitRequest: Running rollback/cleanup of failed split of IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.;
callTimeout=1200000, callDuration=1214938: org.apache.hadoop.hbase.NotServingRegionException:
Region hbase:meta,,1 is not online on prod006,60020,1564942121270
Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
regionserver.HRegionServer: ABORTING region server prod021,60020,1564943354635: Abort; we
got an error after point-of-no-return

The same picture on other crashed servers


Any ideas why this happened and how to prevent this in future?
Mime
View raw message