In the same time in zookeeper log:
2017-03-23 02:01:33,004 - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught
end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid 0x35af577e0ac0000, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2017-03-23 02:01:35,482 - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed
socket connection for client /192.168.80.51:44456 which had sessionid
0x35af577e0ac0000
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780
On 23/03/2017 08:43, Ted Yu wrote:
> Have you checked zookeeper logs to see if there was some clue ?
>
> Cheers
>
>> On Mar 22, 2017, at 11:30 PM, Margus Roo <margus@roo.ee> wrote:
>>
>> Hi
>>
>> Almost every night hbase master is closed. In error log I can see:
>> gc.log:
>> 2017-03-23T01:59:27.239+0200: 41752.366: [GC (Allocation Failure) 2017-03-23T01:59:27.239+0200:
41752.366: [ParNew: 159203K->11611K(166464K), 0.0115189 secs] 177260K->29669K(536512K),
0.0117362 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
>> Heap
>> par new generation total 166464K, used 137930K [0x00000000c0000000, 0x00000000cb4a0000,
0x00000000d5550000)
>> eden space 147968K, 85% used [0x00000000c0000000, 0x00000000c7b5b8b8, 0x00000000c9080000)
>> from space 18496K, 62% used [0x00000000ca290000, 0x00000000cade6fa8, 0x00000000cb4a0000)
>> to space 18496K, 0% used [0x00000000c9080000, 0x00000000c9080000, 0x00000000ca290000)
>> concurrent mark-sweep generation total 370048K, used 18057K [0x00000000d5550000,
0x00000000ebeb0000, 0x0000000100000000)
>> Metaspace used 55061K, capacity 56096K, committed 56400K, reserved 1099776K
>> class space used 5899K, capacity 6255K, committed 6264K, reserved 1048576K
>>
>>
>>
>>
>> In master.log
>> 2017-03-23 02:02:09,178 WARN [master/nn3/192.168.80.51:16000-EventThread] client.ConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,579 FATAL [main-EventThread] master.HMaster: Master server abort:
loaded coprocessors are: [org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,
org.apache.hadoop.hbase.backup.master.BackupController, org.apache.hadoop.hbase.security.visibility.VisibilityController]
>> 2017-03-23 02:02:10,857 FATAL [main-EventThread] master.HMaster: master:16000-0x15adbb9b9db078a,
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure master:16000-0x15adbb9b9db078a
received expired from ZooKeeper, aborting
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,090 INFO [main-SendThread(nn3:2181)] zookeeper.ClientCnxn: Unable
to reconnect to ZooKeeper service, session 0x15adbb9b9db078a has expired, closing socket connection
>> 2017-03-23 02:02:09,181 WARN [nn3:16000.activeMasterManager-EventThread] client.ConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, closing it. It will be recreated next time
someone needs it
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x25adbb9ba62075d
>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] zookeeper.ClientCnxn:
EventThread shut down
>> 2017-03-23 02:02:10,876 INFO [master/nn3/192.168.80.51:16000-EventThread] client.ConnectionManager$HConnectionImplementation:
Closing zookeeper sessionid=0x25adbb9ba62075c
>> 2017-03-23 02:02:10,897 INFO [master/nn3/192.168.80.51:16000-EventThread] zookeeper.ClientCnxn:
EventThread shut down
>> 2017-03-23 02:02:10,925 INFO [main-EventThread] regionserver.HRegionServer: STOPPED:
master:16000-0x15adbb9b9db078a, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure
master:16000-0x15adbb9b9db078a received expired from ZooKeeper, aborting
>> 2017-03-23 02:02:10,935 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread
shut down
>> 2017-03-23 02:02:11,005 INFO [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
Stopping infoServer
>> 2017-03-23 02:02:11,624 INFO [nn3,16000,1490185417271_splitLogManager__ChoreService_1]
master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout Monitor was stopped
>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:12,104 INFO [master/nn3/192.168.80.51:16000] mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:16010
>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:12,104 INFO [master/nn3/192.168.80.51:16000] mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:16010
>> 2017-03-23 02:02:12,286 INFO [master/nn3/192.168.80.51:16000] procedure2.ProcedureExecutor:
Stopping the procedure executor
>> 2017-03-23 02:02:12,336 INFO [master/nn3/192.168.80.51:16000] wal.WALProcedureStore:
Stopping the WAL Procedure Store
>> 2017-03-23 02:02:13,044 WARN [nn3,16000,1490185417271_ChoreService_1] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
>> 2017-03-23 02:02:14,497 INFO [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
stopping server nn3,16000,1490185417271
>> 2017-03-23 02:02:14,514 INFO [master/nn3/192.168.80.51:16000] regionserver.HRegionServer:
stopping server nn3,16000,1490185417271; all regions closed.
>> 2017-03-23 02:02:14,532 INFO [master/nn3/192.168.80.51:16000] hbase.ChoreService:
Chore service for: nn3,16000,1490185417271 had [[ScheduledChore: Name: CatalogJanitor-nn3:16000
Period: 300000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit:
MILLISECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-ExpiredMobFileCleanerChore Period:
86400 Unit: SECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-MobCompactionChore Period:
604800 Unit: SECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-ClusterStatusChore Period:
60000 Unit: MILLISECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-BalancerChore Period:
300000 Unit: MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS],
[ScheduledChore: Name: nn3,16000,1490185417271-RegionNormalizerChore Period: 1800000 Unit:
MILLISECONDS]] on shutdown
>> 2017-03-23 02:02:14,630 INFO [master/nn3/192.168.80.51:16000] master.MasterMobCompactionThread:
Waiting for Mob Compaction Thread to finish...
>> 2017-03-23 02:02:14,644 INFO [master/nn3/192.168.80.51:16000] master.MasterMobCompactionThread:
Waiting for Region Server Mob Compaction Thread to finish...
>> 2017-03-23 02:02:14,671 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:15,684 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:17,684 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:21,685 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:29,685 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:02:45,686 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:03:17,686 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:04:21,686 WARN [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
Possibly transient ZooKeeper, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase-unsecure/master
>> 2017-03-23 02:04:21,687 ERROR [master/nn3/192.168.80.51:16000] zookeeper.RecoverableZooKeeper:
ZooKeeper getData failed after 7 attempts
>> 2017-03-23 02:04:21,687 WARN [master/nn3/192.168.80.51:16000] zookeeper.ZKUtil:
master:16000-0x15adbb9b9db078a, quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure
Unable to get data of znode /hbase-unsecure/master
>> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session
expired for /hbase-unsecure/master
>> ...
>>
>>
>>
>>
>>
>> hbase-site.xml:
>> <configuration>
>>
>> <property>
>> <name>dfs.client.read.shortcircuit</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>dfs.domain.socket.path</name>
>> <value>/var/lib/hadoop-hdfs/dn_socket</value>
>> </property>
>>
>> <property>
>> <name>hbase.bulkload.staging.dir</name>
>> <value>/apps/hbase/staging</value>
>> </property>
>>
>> <property>
>> <name>hbase.client.keyvalue.maxsize</name>
>> <value>1048576</value>
>> </property>
>>
>> <property>
>> <name>hbase.client.retries.number</name>
>> <value>35</value>
>> </property>
>>
>> <property>
>> <name>hbase.client.scanner.caching</name>
>> <value>100</value>
>> </property>
>>
>> <property>
>> <name>hbase.client.scanner.timeout.period</name>
>> <value>600000</value>
>> </property>
>>
>> <property>
>> <name>hbase.cluster.distributed</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hbase.coprocessor.master.classes</name>
>> <value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>> </property>
>>
>> <property>
>> <name>hbase.coprocessor.region.classes</name>
>> <value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>> </property>
>>
>> <property>
>> <name>hbase.coprocessor.regionserver.classes</name>
>> <value>org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
>> </property>
>> <property>
>> <name>hbase.hregion.majorcompaction</name>
>> <value>604800000</value>
>> </property>
>>
>> <property>
>> <name>hbase.hregion.majorcompaction.jitter</name>
>> <value>0.50</value>
>> </property>
>>
>> <property>
>> <name>hbase.hregion.max.filesize</name>
>> <value>10737418240</value>
>> </property>
>>
>> <property>
>> <name>hbase.hregion.memstore.block.multiplier</name>
>> <value>4</value>
>> </property>
>>
>> <property>
>> <name>hbase.hregion.memstore.flush.size</name>
>> <value>134217728</value>
>> </property>
>>
>> <property>
>> <name>hbase.hregion.memstore.mslab.enabled</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hbase.hstore.blockingStoreFiles</name>
>> <value>10</value>
>> </property>
>>
>> <property>
>> <name>hbase.hstore.compaction.max</name>
>> <value>10</value>
>> </property>
>>
>> <property>
>> <name>hbase.hstore.compactionThreshold</name>
>> <value>3</value>
>> </property>
>>
>> <property>
>> <name>hbase.local.dir</name>
>> <value>${hbase.tmp.dir}/local</value>
>> </property>
>> <property>
>> <name>hbase.master.info.bindAddress</name>
>> <value>0.0.0.0</value>
>> </property>
>>
>> <property>
>> <name>hbase.master.info.port</name>
>> <value>16010</value>
>> </property>
>>
>> <property>
>> <name>hbase.master.loadbalance.bytable</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hbase.master.port</name>
>> <value>16000</value>
>> </property>
>>
>> <property>
>> <name>hbase.master.ui.readonly</name>
>> <value>false</value>
>> </property>
>>
>> <property>
>> <name>hbase.regionserver.global.memstore.size</name>
>> <value>0.4</value>
>> </property>
>>
>> <property>
>> <name>hbase.regionserver.handler.count</name>
>> <value>30</value>
>> </property>
>>
>> <property>
>> <name>hbase.regionserver.info.port</name>
>> <value>16030</value>
>> </property>
>>
>> <property>
>> <name>hbase.regionserver.port</name>
>> <value>16020</value>
>> </property>
>>
>> <property>
>> <name>hbase.regionserver.wal.codec</name>
>> <value>org.apache.hadoop.hbase.regionserver.wal.WALCellCodec</value>
>> </property>
>>
>> <property>
>> <name>hbase.rootdir</name>
>> <value>hdfs://nn3:8020/apps/hbase/data</value>
>> </property>
>>
>> <property>
>> <name>hbase.rpc.protection</name>
>> <value>authentication</value>
>> </property>
>>
>> <property>
>> <name>hbase.rpc.timeout</name>
>> <value>90000</value>
>> </property>
>>
>> <property>
>> <name>hbase.security.authentication</name>
>> <value>simple</value>
>> </property>
>>
>> <property>
>> <name>hbase.security.authorization</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hbase.superuser</name>
>> <value>hbase</value>
>> </property>
>>
>> <property>
>> <name>hbase.tmp.dir</name>
>> <value>/tmp/hbase-${user.name}</value>
>> </property>
>>
>> <property>
>> <name>hbase.zookeeper.property.clientPort</name>
>> <value>2181</value>
>> </property>
>>
>> <property>
>> <name>hbase.zookeeper.quorum</name>
>> <value>bigdata33,bigdata36,nn3</value>
>> </property>
>>
>> <property>
>> <name>hbase.zookeeper.useMulti</name>
>> <value>true</value>
>> </property>
>>
>> <property>
>> <name>hfile.block.cache.size</name>
>> <value>0.4</value>
>> </property>
>>
>> <property>
>> <name>hfile.format.version</name>
>> <value>3</value>
>> </property>
>>
>> <property>
>> <name>phoenix.query.timeoutMs</name>
>> <value>60000</value>
>> </property>
>>
>> <property>
>> <name>replication.executor.workers</name>
>> <value>2</value>
>> </property>
>>
>> <property>
>> <name>replication.sleep.before.failover</name>
>> <value>60000</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.recovery.retry</name>
>> <value>6</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.session.timeout</name>
>> <value>90000</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.znode.parent</name>
>> <value>/hbase-unsecure</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.znode.replication</name>
>> <value>replication</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.znode.replication.peers</name>
>> <value>peers</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.znode.replication.peers.state</name>
>> <value>peer-state</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.znode.replication.rs</name>
>> <value>rs</value>
>> </property>
>>
>> </configuration>
>>
>> Any hints?
>>
>> --
>> Margus (margusja) Roo
>> http://margus.roo.ee
>> skype: margusja
>> https://www.facebook.com/allan.tuuring
>> +372 51 48 780
>>
|