hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Margus Roo <mar...@roo.ee>
Subject Re: HBase master dies (1.1.2) often
Date Thu, 23 Mar 2017 07:04:20 GMT
In the same time in zookeeper log:

2017-03-23 02:01:33,004 - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught 
end of stream exception
EndOfStreamException: Unable to read additional data from client 
sessionid 0x35af577e0ac0000, likely client has closed socket
         at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
         at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
         at java.lang.Thread.run(Thread.java:745)
2017-03-23 02:01:35,482 - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
socket connection for client /192.168.80.51:44456 which had sessionid 
0x35af577e0ac0000


Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780

On 23/03/2017 08:43, Ted Yu wrote:
> Have you checked zookeeper logs to see if there was some clue ?
>
> Cheers
>
>> On Mar 22, 2017, at 11:30 PM, Margus Roo <margus@roo.ee> wrote:
>>
>> Hi
>>
>> Almost every night hbase master is closed. In error log I can see:
>> gc.log:
>> 2017-03-23T01:59:27.239+0200: 41752.366: [GC (Allocation Failure) 2017-03-23T01:59:27.239+0200:
41752.366: [ParNew: 159203K->11611K(166464K), 0.0115189 secs] 177260K->29669K(536512K),
0.0117362 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
>> Heap
>> par new generation   total 166464K, used 137930K [0x00000000c0000000, 0x00000000cb4a0000,
0x00000000d5550000)
>>   eden space 147968K,  85% used [0x00000000c0000000, 0x00000000c7b5b8b8, 0x00000000c9080000)
>>   from space 18496K,  62% used [0x00000000ca290000, 0x00000000cade6fa8, 0x00000000cb4a0000)
>>   to   space 18496K,   0% used [0x00000000c9080000, 0x00000000c9080000, 0x00000000ca290000)
>> concurrent mark-sweep generation total 370048K, used 18057K [0x00000000d5550000,
0x00000000ebeb0000, 0x0000000100000000)
>> Metaspace       used 55061K, capacity 56096K, committed 56400K, reserved 1099776K
>>   class space    used 5899K, capacity 6255K, committed 6264K, reserved 1048576K
>>
>>
>>
>>
>> In master.log
>> 2017-03-23 02:02:09,178 WARN [master/nn3/192.168.80.51:16000-EventThread] client.ConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>>         at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,579 FATAL [main-EventThread] master.HMaster: Master server abort:
loaded coprocessors are: [org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,
org.apache.hadoop.hbase.backup.master.BackupController, org.apache.hadoop.hbase.security.visibility.VisibilityController]
>> 2017-03-23 02:02:10,857 FATAL [main-EventThread] master.HMaster: master:16000-0x15adbb9b9db078a,
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure master:16000-0x15adbb9b9db078a
received expired from ZooKeeper, aborting
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>>         at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,090 INFO  [main-SendThread(nn3:2181)] zookeeper.ClientCnxn: Unable
to reconnect to ZooKeeper service, session 0x15adbb9b9db078a has expired, closing socket connection
>> 2017-03-23 02:02:09,181 WARN [nn3:16000.activeMasterManager-EventThread] client.ConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>>         at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x25adbb9ba62075d
>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn:
EventThread shut down
>> 2017-03-23 02:02:10,876 INFO [master/nn3/192.168.80.51:16000-EventThread] client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x25adbb9ba62075c
>> 2017-03-23 02:02:10,897 INFO [master/nn3/192.168.80.51:16000-EventThread] zookeeper.ClientCnxn:
EventThread shut down
>> 2017-03-23 02:02:10,925 INFO  [main-EventThread] regionserver.HRegionServer: STOPPED:
master:16000-0x15adbb9b9db078a, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure
master:16000-0x15adbb9b9db078a received expired from ZooKeeper, aborting
>> 2017-03-23 02:02:10,935 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread
shut down
>> 2017-03-23 02:02:11,005 INFO  [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
Stopping infoServer
>> 2017-03-23 02:02:11,624 INFO [nn3,16000,1490185417271_splitLogManager__ChoreService_1]
master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped
>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:12,104 INFO  [master/nn3/192.168.80.51:16000] mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:16010
>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:12,104 INFO  [master/nn3/192.168.80.51:16000] mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:16010
>> 2017-03-23 02:02:12,286 INFO  [master/nn3/192.168.80.51:16000] procedure2.ProcedureExecutor:
Stopping the procedure executor
>> 2017-03-23 02:02:12,336 INFO  [master/nn3/192.168.80.51:16000] wal.WALProcedureStore:
Stopping the WAL Procedure Store
>> 2017-03-23 02:02:13,044 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:14,497 INFO  [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
stopping server nn3,16000,1490185417271
>> 2017-03-23 02:02:14,514 INFO  [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
stopping server nn3,16000,1490185417271; all regions closed.
>> 2017-03-23 02:02:14,532 INFO  [master/nn3/192.168.80.51:16000] hbase.ChoreService:
Chore service for: nn3,16000,1490185417271 had [[ScheduledChore: Name: CatalogJanitor-nn3:16000
Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit:
MILLISECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-ExpiredMobFileCleanerChore Period:
86400 Unit: SECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-MobCompactionChore Period:
604800 Unit: SECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-ClusterStatusChore Period:
60000 Unit: MILLISECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-BalancerChore Period:
300000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS],
[ScheduledChore: Name: nn3,16000,1490185417271-RegionNormalizerChore Period: 1800000 Unit:
MILLISECONDS]] on shutdown
>> 2017-03-23 02:02:14,630 INFO  [master/nn3/192.168.80.51:16000] master.MasterMobCompactionThread:
Waiting for Mob Compaction Thread to finish...
>> 2017-03-23 02:02:14,644 INFO  [master/nn3/192.168.80.51:16000] master.MasterMobCompactionThread:
Waiting for Region Server Mob Compaction Thread to finish...
>> 2017-03-23 02:02:14,671 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:15,684 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:17,684 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:21,685 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:29,685 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:45,686 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:03:17,686 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:04:21,686 WARN  [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:04:21,687 ERROR [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
ZooKeeper getData failed after 7 attempts
>> 2017-03-23 02:04:21,687 WARN  [master/nn3/192.168.80.51:16000] zookeeper.ZKUtil:
master:16000-0x15adbb9b9db078a, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure
Unable to get data of znode /hbase-unsecure/master
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /hbase-unsecure/master
>> ...
>>
>>
>>
>>
>>
>> hbase-site.xml:
>>   <configuration>
>>
>>     <property>
>>       <name>dfs.client.read.shortcircuit</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>dfs.domain.socket.path</name>
>>       <value>/var/lib/hadoop-hdfs/dn_socket</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.bulkload.staging.dir</name>
>>       <value>/apps/hbase/staging</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.client.keyvalue.maxsize</name>
>>       <value>1048576</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.client.retries.number</name>
>>       <value>35</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.client.scanner.caching</name>
>>       <value>100</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.client.scanner.timeout.period</name>
>>       <value>600000</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.cluster.distributed</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.coprocessor.master.classes</name>
>> <value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.coprocessor.region.classes</name>
>> <value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>>     </property>
>>
>>     <property>
>> <name>hbase.coprocessor.regionserver.classes</name>
>> <value>org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>>     </property>
>>     <property>
>>       <name>hbase.hregion.majorcompaction</name>
>>       <value>604800000</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hregion.majorcompaction.jitter</name>
>>       <value>0.50</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hregion.max.filesize</name>
>>       <value>10737418240</value>
>>     </property>
>>
>>     <property>
>> <name>hbase.hregion.memstore.block.multiplier</name>
>>       <value>4</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hregion.memstore.flush.size</name>
>>       <value>134217728</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hregion.memstore.mslab.enabled</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hstore.blockingStoreFiles</name>
>>       <value>10</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hstore.compaction.max</name>
>>       <value>10</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.hstore.compactionThreshold</name>
>>       <value>3</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.local.dir</name>
>>       <value>${hbase.tmp.dir}/local</value>
>>     </property>
>>     <property>
>>       <name>hbase.master.info.bindAddress</name>
>>       <value>0.0.0.0</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.master.info.port</name>
>>       <value>16010</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.master.loadbalance.bytable</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.master.port</name>
>>       <value>16000</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.master.ui.readonly</name>
>>       <value>false</value>
>>     </property>
>>
>>     <property>
>> <name>hbase.regionserver.global.memstore.size</name>
>>       <value>0.4</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.regionserver.handler.count</name>
>>       <value>30</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.regionserver.info.port</name>
>>       <value>16030</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.regionserver.port</name>
>>       <value>16020</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.regionserver.wal.codec</name>
>> <value>org.apache.hadoop.hbase.regionserver.wal.WALCellCodec</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.rootdir</name>
>>       <value>hdfs://nn3:8020/apps/hbase/data</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.rpc.protection</name>
>>       <value>authentication</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.rpc.timeout</name>
>>       <value>90000</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.security.authentication</name>
>>       <value>simple</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.security.authorization</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.superuser</name>
>>       <value>hbase</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.tmp.dir</name>
>>       <value>/tmp/hbase-${user.name}</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.zookeeper.property.clientPort</name>
>>       <value>2181</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.zookeeper.quorum</name>
>>       <value>bigdata33,bigdata36,nn3</value>
>>     </property>
>>
>>     <property>
>>       <name>hbase.zookeeper.useMulti</name>
>>       <value>true</value>
>>     </property>
>>
>>     <property>
>>       <name>hfile.block.cache.size</name>
>>       <value>0.4</value>
>>     </property>
>>
>>     <property>
>>       <name>hfile.format.version</name>
>>       <value>3</value>
>>     </property>
>>
>>     <property>
>>       <name>phoenix.query.timeoutMs</name>
>>       <value>60000</value>
>>     </property>
>>
>>     <property>
>>       <name>replication.executor.workers</name>
>>       <value>2</value>
>>     </property>
>>
>>     <property>
>>       <name>replication.sleep.before.failover</name>
>>       <value>60000</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.recovery.retry</name>
>>       <value>6</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.session.timeout</name>
>>       <value>90000</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.znode.parent</name>
>>       <value>/hbase-unsecure</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.znode.replication</name>
>>       <value>replication</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.znode.replication.peers</name>
>>       <value>peers</value>
>>     </property>
>>
>>     <property>
>> <name>zookeeper.znode.replication.peers.state</name>
>>       <value>peer-state</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.znode.replication.rs</name>
>>       <value>rs</value>
>>     </property>
>>
>>   </configuration>
>>
>> Any hints?
>>
>> -- 
>> Margus (margusja) Roo
>> http://margus.roo.ee
>> skype: margusja
>> https://www.facebook.com/allan.tuuring
>> +372 51 48 780
>>


Mime
View raw message