hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Margus Roo <mar...@roo.ee>
Subject HBase master dies (1.1.2) often
Date Thu, 23 Mar 2017 06:30:14 GMT
Hi

Almost every night hbase master is closed. In error log I can see:
gc.log:
2017-03-23T01:59:27.239+0200: 41752.366: [GC (Allocation Failure) 
2017-03-23T01:59:27.239+0200: 41752.366: [ParNew: 
159203K->11611K(166464K), 0.0115189 secs] 177260K->29669K(536512K), 
0.0117362 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
Heap
  par new generation   total 166464K, used 137930K [0x00000000c0000000, 
0x00000000cb4a0000, 0x00000000d5550000)
   eden space 147968K,  85% used [0x00000000c0000000, 
0x00000000c7b5b8b8, 0x00000000c9080000)
   from space 18496K,  62% used [0x00000000ca290000, 0x00000000cade6fa8, 
0x00000000cb4a0000)
   to   space 18496K,   0% used [0x00000000c9080000, 0x00000000c9080000, 
0x00000000ca290000)
  concurrent mark-sweep generation total 370048K, used 18057K 
[0x00000000d5550000, 0x00000000ebeb0000, 0x0000000100000000)
  Metaspace       used 55061K, capacity 56096K, committed 56400K, 
reserved 1099776K
   class space    used 5899K, capacity 6255K, committed 6264K, reserved 
1048576K




In master.log
2017-03-23 02:02:09,178 WARN 
[master/nn3/192.168.80.51:16000-EventThread] 
client.ConnectionManager$HConnectionImplementation: This client just 
lost it's session with ZooKeeper, closing it. It will be recreated next 
time someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
         at 
org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2017-03-23 02:02:10,579 FATAL [main-EventThread] master.HMaster: Master 
server abort: loaded coprocessors are: 
[org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor, 
org.apache.hadoop.hbase.backup.master.BackupController, 
org.apache.hadoop.hbase.security.visibility.VisibilityController]
2017-03-23 02:02:10,857 FATAL [main-EventThread] master.HMaster: 
master:16000-0x15adbb9b9db078a, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure 
master:16000-0x15adbb9b9db078a received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
         at 
org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2017-03-23 02:02:10,090 INFO  [main-SendThread(nn3:2181)] 
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 
0x15adbb9b9db078a has expired, closing socket connection
2017-03-23 02:02:09,181 WARN [nn3:16000.activeMasterManager-EventThread] 
client.ConnectionManager$HConnectionImplementation: This client just 
lost it's session with ZooKeeper, closing it. It will be recreated next 
time someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:585)
         at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
         at 
org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
         at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] 
client.ConnectionManager$HConnectionImplementation: Closing zookeeper 
sessionid=0x25adbb9ba62075d
2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] 
zookeeper.ClientCnxn: EventThread shut down
2017-03-23 02:02:10,876 INFO 
[master/nn3/192.168.80.51:16000-EventThread] 
client.ConnectionManager$HConnectionImplementation: Closing zookeeper 
sessionid=0x25adbb9ba62075c
2017-03-23 02:02:10,897 INFO 
[master/nn3/192.168.80.51:16000-EventThread] zookeeper.ClientCnxn: 
EventThread shut down
2017-03-23 02:02:10,925 INFO  [main-EventThread] 
regionserver.HRegionServer: STOPPED: master:16000-0x15adbb9b9db078a, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure 
master:16000-0x15adbb9b9db078a received expired from ZooKeeper, aborting
2017-03-23 02:02:10,935 INFO  [main-EventThread] zookeeper.ClientCnxn: 
EventThread shut down
2017-03-23 02:02:11,005 INFO  [master/nn3/192.168.80.51:16000] 
regionserver.HRegionServer: Stopping infoServer
2017-03-23 02:02:11,624 INFO 
[nn3,16000,1490185417271_splitLogManager__ChoreService_1] 
master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout 
Monitor was stopped
2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
2017-03-23 02:02:12,104 INFO  [master/nn3/192.168.80.51:16000] 
mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010
2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
2017-03-23 02:02:12,104 INFO  [master/nn3/192.168.80.51:16000] 
mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010
2017-03-23 02:02:12,286 INFO  [master/nn3/192.168.80.51:16000] 
procedure2.ProcedureExecutor: Stopping the procedure executor
2017-03-23 02:02:12,336 INFO  [master/nn3/192.168.80.51:16000] 
wal.WALProcedureStore: Stopping the WAL Procedure Store
2017-03-23 02:02:13,044 WARN [nn3,16000,1490185417271_ChoreService_1] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters
2017-03-23 02:02:14,497 INFO  [master/nn3/192.168.80.51:16000] 
regionserver.HRegionServer: stopping server nn3,16000,1490185417271
2017-03-23 02:02:14,514 INFO  [master/nn3/192.168.80.51:16000] 
regionserver.HRegionServer: stopping server nn3,16000,1490185417271; all 
regions closed.
2017-03-23 02:02:14,532 INFO  [master/nn3/192.168.80.51:16000] 
hbase.ChoreService: Chore service for: nn3,16000,1490185417271 had 
[[ScheduledChore: Name: CatalogJanitor-nn3:16000 Period: 300000 Unit: 
MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: 
MILLISECONDS], [ScheduledChore: Name: 
nn3,16000,1490185417271-ExpiredMobFileCleanerChore Period: 86400 Unit: 
SECONDS], [ScheduledChore: Name: 
nn3,16000,1490185417271-MobCompactionChore Period: 604800 Unit: 
SECONDS], [ScheduledChore: Name: 
nn3,16000,1490185417271-ClusterStatusChore Period: 60000 Unit: 
MILLISECONDS], [ScheduledChore: Name: 
nn3,16000,1490185417271-BalancerChore Period: 300000 Unit: 
MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: 
MILLISECONDS], [ScheduledChore: Name: 
nn3,16000,1490185417271-RegionNormalizerChore Period: 1800000 Unit: 
MILLISECONDS]] on shutdown
2017-03-23 02:02:14,630 INFO  [master/nn3/192.168.80.51:16000] 
master.MasterMobCompactionThread: Waiting for Mob Compaction Thread to 
finish...
2017-03-23 02:02:14,644 INFO  [master/nn3/192.168.80.51:16000] 
master.MasterMobCompactionThread: Waiting for Region Server Mob 
Compaction Thread to finish...
2017-03-23 02:02:14,671 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:02:15,684 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:02:17,684 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:02:21,685 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:02:29,685 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:02:45,686 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:03:17,686 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:04:21,686 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, 
exception=org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
2017-03-23 02:04:21,687 ERROR [master/nn3/192.168.80.51:16000] 
zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 7 attempts
2017-03-23 02:04:21,687 WARN  [master/nn3/192.168.80.51:16000] 
zookeeper.ZKUtil: master:16000-0x15adbb9b9db078a, 
quorum=bigdata33:2181,bigdata36:2181,nn3:2181, baseZNode=/hbase-unsecure 
Unable to get data of znode /hbase-unsecure/master
org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /hbase-unsecure/master
...





hbase-site.xml:
   <configuration>

     <property>
       <name>dfs.client.read.shortcircuit</name>
       <value>true</value>
     </property>

     <property>
       <name>dfs.domain.socket.path</name>
       <value>/var/lib/hadoop-hdfs/dn_socket</value>
     </property>

     <property>
       <name>hbase.bulkload.staging.dir</name>
       <value>/apps/hbase/staging</value>
     </property>

     <property>
       <name>hbase.client.keyvalue.maxsize</name>
       <value>1048576</value>
     </property>

     <property>
       <name>hbase.client.retries.number</name>
       <value>35</value>
     </property>

     <property>
       <name>hbase.client.scanner.caching</name>
       <value>100</value>
     </property>

     <property>
       <name>hbase.client.scanner.timeout.period</name>
       <value>600000</value>
     </property>

     <property>
       <name>hbase.cluster.distributed</name>
       <value>true</value>
     </property>

     <property>
       <name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
     </property>

     <property>
       <name>hbase.coprocessor.region.classes</name>
<value>org.apache.hadoop.hbase.security.visibility.VisibilityController,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
     </property>

     <property>
<name>hbase.coprocessor.regionserver.classes</name>
<value>org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor</value>
     </property>
     <property>
       <name>hbase.hregion.majorcompaction</name>
       <value>604800000</value>
     </property>

     <property>
       <name>hbase.hregion.majorcompaction.jitter</name>
       <value>0.50</value>
     </property>

     <property>
       <name>hbase.hregion.max.filesize</name>
       <value>10737418240</value>
     </property>

     <property>
<name>hbase.hregion.memstore.block.multiplier</name>
       <value>4</value>
     </property>

     <property>
       <name>hbase.hregion.memstore.flush.size</name>
       <value>134217728</value>
     </property>

     <property>
       <name>hbase.hregion.memstore.mslab.enabled</name>
       <value>true</value>
     </property>

     <property>
       <name>hbase.hstore.blockingStoreFiles</name>
       <value>10</value>
     </property>

     <property>
       <name>hbase.hstore.compaction.max</name>
       <value>10</value>
     </property>

     <property>
       <name>hbase.hstore.compactionThreshold</name>
       <value>3</value>
     </property>

     <property>
       <name>hbase.local.dir</name>
       <value>${hbase.tmp.dir}/local</value>
     </property>
     <property>
       <name>hbase.master.info.bindAddress</name>
       <value>0.0.0.0</value>
     </property>

     <property>
       <name>hbase.master.info.port</name>
       <value>16010</value>
     </property>

     <property>
       <name>hbase.master.loadbalance.bytable</name>
       <value>true</value>
     </property>

     <property>
       <name>hbase.master.port</name>
       <value>16000</value>
     </property>

     <property>
       <name>hbase.master.ui.readonly</name>
       <value>false</value>
     </property>

     <property>
<name>hbase.regionserver.global.memstore.size</name>
       <value>0.4</value>
     </property>

     <property>
       <name>hbase.regionserver.handler.count</name>
       <value>30</value>
     </property>

     <property>
       <name>hbase.regionserver.info.port</name>
       <value>16030</value>
     </property>

     <property>
       <name>hbase.regionserver.port</name>
       <value>16020</value>
     </property>

     <property>
       <name>hbase.regionserver.wal.codec</name>
<value>org.apache.hadoop.hbase.regionserver.wal.WALCellCodec</value>
     </property>

     <property>
       <name>hbase.rootdir</name>
       <value>hdfs://nn3:8020/apps/hbase/data</value>
     </property>

     <property>
       <name>hbase.rpc.protection</name>
       <value>authentication</value>
     </property>

     <property>
       <name>hbase.rpc.timeout</name>
       <value>90000</value>
     </property>

     <property>
       <name>hbase.security.authentication</name>
       <value>simple</value>
     </property>

     <property>
       <name>hbase.security.authorization</name>
       <value>true</value>
     </property>

     <property>
       <name>hbase.superuser</name>
       <value>hbase</value>
     </property>

     <property>
       <name>hbase.tmp.dir</name>
       <value>/tmp/hbase-${user.name}</value>
     </property>

     <property>
       <name>hbase.zookeeper.property.clientPort</name>
       <value>2181</value>
     </property>

     <property>
       <name>hbase.zookeeper.quorum</name>
       <value>bigdata33,bigdata36,nn3</value>
     </property>

     <property>
       <name>hbase.zookeeper.useMulti</name>
       <value>true</value>
     </property>

     <property>
       <name>hfile.block.cache.size</name>
       <value>0.4</value>
     </property>

     <property>
       <name>hfile.format.version</name>
       <value>3</value>
     </property>

     <property>
       <name>phoenix.query.timeoutMs</name>
       <value>60000</value>
     </property>

     <property>
       <name>replication.executor.workers</name>
       <value>2</value>
     </property>

     <property>
       <name>replication.sleep.before.failover</name>
       <value>60000</value>
     </property>

     <property>
       <name>zookeeper.recovery.retry</name>
       <value>6</value>
     </property>

     <property>
       <name>zookeeper.session.timeout</name>
       <value>90000</value>
     </property>

     <property>
       <name>zookeeper.znode.parent</name>
       <value>/hbase-unsecure</value>
     </property>

     <property>
       <name>zookeeper.znode.replication</name>
       <value>replication</value>
     </property>

     <property>
       <name>zookeeper.znode.replication.peers</name>
       <value>peers</value>
     </property>

     <property>
<name>zookeeper.znode.replication.peers.state</name>
       <value>peer-state</value>
     </property>

     <property>
       <name>zookeeper.znode.replication.rs</name>
       <value>rs</value>
     </property>

   </configuration>

Any hints?

-- 
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
https://www.facebook.com/allan.tuuring
+372 51 48 780


Mime
View raw message