hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: RegionServer crashed with ZK session expired
Date Thu, 09 Dec 2010 17:44:19 GMT
Hey Alex,

It's either the client or the server box that wasn't responding,
either way they didn't talk for 52 seconds when your session timeout
was set at 40 seconds. I suggest you also take a look at the ZK log,
it should tell you exactly when the session was expired.

J-D

On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau <alex.baranov.v@gmail.com> wrote:
> Hello guys,
>
> I got regionserver crash and trying to find out why. I found
> * in regionserver log [1] : ZK session expired, and before that slow hlog
> edits;
> * nothing in DataNode log [2] and HMaster[4] log;
> * some warns in ZK log [3] with EndOfStreamException.
>
> I wonder can it be because of long GC and thus "stop the world" pause for
> about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an
> old test box). If it can be caused by GC I'd restart the server with GC
> logging, but if not, I'd like to figure out the cause...
>
> HBase version is 0.89.20100924.
>
> Thank you,
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/
>
> [1] Regionserver log: ZK session expired, before that slow hlog edits
>
> 2010-12-08 06:43:15,403 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Starting compaction on region agg2,system|tq|s\x00|in licensed License
> project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.
> 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store:
> Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed
> License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.,
> hasReferences=true, into
> hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp,
> sequenceid=1153119174
> 2010-12-08 06:43:24,825 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 39.93 MB of total=339.31 MB
> 2010-12-08 06:43:24,829 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB, multi=154.82
> MB, memory=33.28 MB
> 2010-12-08 06:43:50,871 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 40 MB of total=339.39 MB
> 2010-12-08 06:43:50,875 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB, multi=154.96
> MB, memory=33.28 MB
> 2010-12-08 06:43:57,095 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog;
> editcount=8882
> 2010-12-08 06:44:25,086 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog;
> editcount=13054
> 2010-12-08 06:44:55,815 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog;
> editcount=13472
> 2010-12-08 06:45:33,003 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5 MB,
> free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227,
> hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453,
> evictedPerRun=636.77880859375
> 2010-12-08 06:45:45,333 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
> raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global
> memstore limit of 798.4m exceeded; currently 812.8m and flushing till 698.6m
> 2010-12-08 06:45:45,363 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for region
> raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region
> memstore size 45.0m
> 2010-12-08 06:45:45,364 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Finished snapshotting, commencing flushing stores
> 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store:
> Renaming flushed file at
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831
> to
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580
> 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store:
> Added
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580,
> entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to
> raw3,,1291780273350.73d076a68191712c68d16c4805973561.
> 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 52481ms for sessionid
> 0x12c34f1783c0002, closing socket connection and attempting reconnect
> 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 52481ms for sessionid
> 0x12c34f1783c00bf, closing socket connection and attempting reconnect
> 2010-12-08 06:46:44,200 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-12-08 06:46:44,200 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-12-08 06:46:44,201 WARN
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No longer
> connected to ZooKeeper, current state: Disconnected
> 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
> initiating session
> 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server ip-10-245-202-207.ec2.internal/
> 10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout =
> 40000
> 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12c34f1783c00bf closed
> 2010-12-08 06:46:51,264 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
> <localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed
> connection with ZooKeeper; /hbase/root-region-server
> 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
> initiating session
> 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired,
> closing socket connection
> 2010-12-08 06:46:58,491 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2010-12-08 06:46:58,491 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> 2010-12-08 06:46:58,491 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server
> serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235,
> load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper
> session expired
>
>
> [2] DataNode log: nothing, just noticeable pause for almost 1 minute
>
> 2010-12-08 06:45:58,120 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_6449721912917715607_913675, duration: 3639142
> 2010-12-08 06:45:58,134 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_6449721912917715607_913675, duration: 427441
> 2010-12-08 06:45:58,148 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_825672696763040280_900371, duration: 6352102264
> 2010-12-08 06:46:44,192 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /127.0.0.1:50010
> 2010-12-08 06:46:51,263 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_8449446729722678098_531164, duration: 401874426400
> 2010-12-08 06:46:51,333 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_-4815658898725325280_624191, duration: 401614939902
> 2010-12-08 06:46:58,266 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_7609568532460932683_912614, targets=[127.0.0.1:50010
> ])
>
> [3] Zookeeper log: some warns
>
> 2010-12-08 02:50:46,100 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
> - Processed session termination for sessionid: 0x12c34f1783d6eba
> 2010-12-08 02:50:46,104 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba
> 2010-12-08 06:46:44,092 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x12c34f1783c00bf, likely client
> has closed socket
> 2010-12-08 06:46:44,661 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf
> 2010-12-08 06:46:44,662 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x12c34f1783c0002, likely client
> has closed socket
> 2010-12-08 06:46:44,662 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002
> 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connection
> from /10.245.202.207:33388
> 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew session
> 0x12c34f1783c00bf at /10.245.202.207:33388
> 2010-12-08 06:46:51,220 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1579] - Established session
> 0x12c34f1783c00bf with negotiated timeout 40000 for client /
> 10.245.202.207:33388
> 2010-12-08 06:46:51,220 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
> - Processed session termination for sessionid: 0x12c34f1783c00bf
> 2010-12-08 06:46:51,223 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf
> 2010-12-08 06:46:58,001 - INFO  [SessionTracker:ZooKeeperServer@315] -
> Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded
> 2010-12-08 06:46:58,002 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
> - Processed session termination for sessionid: 0x12c34f1783c0002
> 2010-12-08 06:46:58,376 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connection
> from /10.245.202.207:33394
> 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew session
> 0x12c34f1783c0002 at /10.245.202.207:33394
> 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1572] - Invalid session 0x12c34f1783c0002
> for client /10.245.202.207:33394, probably expired
> 2010-12-08 06:46:58,490 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002
>
> [4] HMaster log: nothing
>
> 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 648 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>}
> complete
> 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: -ROOT-,,0.70236052, startKey: <>}
> 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
> 2010-12-08 06:45:35,367 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 0 dead, average load 649.0
> 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: .META.,,1.1028785192, startKey: <>}
> 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: -ROOT-,,0.70236052, startKey: <>}
> 2010-12-08 06:46:35,369 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 0 dead, average load 649.0
> 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
> 2010-12-08 06:46:58,005 INFO org.apache.hadoop.hbase.master.ServerManager:
> ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired
> 2010-12-08 06:46:58,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers,
> added shutdown processing operation
> 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
> META region whose startkey is  removed from onlineMetaRegions
> 2010-12-08 06:46:58,006 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
> ProcessServerShutdown of ip-10-245-202-207.ec2.internal,60020,1289379196235
> 2010-12-08 06:46:58,006 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of
> server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: false,
> rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
> 2010-12-08 06:46:58,012 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
> Splitting 34 hlog(s) in
> hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235
>

Mime
View raw message