From user-return-14401-apmail-hbase-user-archive=hbase.apache.org@hbase.apache.org Thu Dec 09 17:44:48 2010 Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 94946 invoked from network); 9 Dec 2010 17:44:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Dec 2010 17:44:47 -0000 Received: (qmail 31105 invoked by uid 500); 9 Dec 2010 17:44:46 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 31080 invoked by uid 500); 9 Dec 2010 17:44:46 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31072 invoked by uid 99); 9 Dec 2010 17:44:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Dec 2010 17:44:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates 209.85.161.50 as permitted sender) Received: from [209.85.161.50] (HELO mail-fx0-f50.google.com) (209.85.161.50) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Dec 2010 17:44:41 +0000 Received: by fxm14 with SMTP id 14so2459482fxm.23 for ; Thu, 09 Dec 2010 09:44:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; bh=BQNE9x3uO1Gj1UNJRYLdOHkp0sa1gMlioQ2W/COg/x8=; b=aFEaniiXTdC1k2xK9TVo/vBTFC8I+17M0/9syzeSZVjgvDzt+SID4q0DdYKb6Y0rqG GLyzwDOi+yna6UBjDO19/TL50gsDzdUS1inrdMPLaV6XCu4aGySo5UrKpnjoRnmbQfZR pVp6BSkfZhrr8Fsm9mZMfa3sLm+tLEFq/kLdU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=NNxwVDx3BMU1wsSjwSmbtef8o3BSp56tsHPUMS1oBzbGfStD0G2HXkIcNPYFETENX6 yHjaqZWS3PoCHmRmdt4NLjAZFiUjXtvae1+6+/u0YxJ3XIj3EyV43ooaeHc12rLzjttl I+fknnQTvtJ90k4wXtjn4laYhrNNl7pNvXfR0= MIME-Version: 1.0 Received: by 10.223.83.206 with SMTP id g14mr62460fal.129.1291916660046; Thu, 09 Dec 2010 09:44:20 -0800 (PST) Sender: jdcryans@gmail.com Received: by 10.223.86.133 with HTTP; Thu, 9 Dec 2010 09:44:19 -0800 (PST) In-Reply-To: References: Date: Thu, 9 Dec 2010 09:44:19 -0800 X-Google-Sender-Auth: pkMHQqJzyG--VwfGhfqjb3QyNzo Message-ID: Subject: Re: RegionServer crashed with ZK session expired From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey Alex, It's either the client or the server box that wasn't responding, either way they didn't talk for 52 seconds when your session timeout was set at 40 seconds. I suggest you also take a look at the ZK log, it should tell you exactly when the session was expired. J-D On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau wro= te: > Hello guys, > > I got regionserver crash and trying to find out why. I found > * in regionserver log [1] : ZK session expired, and before that slow hlog > edits; > * nothing in DataNode log [2] and HMaster[4] log; > * some warns in ZK log [3] with EndOfStreamException. > > I wonder can it be because of long GC and thus "stop the world" pause for > about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an > old test box). If it can be caused by GC I'd restart the server with GC > logging, but if not, I'd like to figure out the cause... > > HBase version is 0.89.20100924. > > Thank you, > > Alex Baranau > ---- > Sematext :: http://sematext.com/ > > [1] Regionserver log: ZK session expired, before that slow hlog edits > > 2010-12-08 06:43:15,403 INFO org.apache.hadoop.hbase.regionserver.HRegion= : > Starting compaction on region agg2,system|tq|s\x00|in licensed License > project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c. > 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store: > Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed > License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c., > hasReferences=3Dtrue, into > hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp, > sequenceid=3D1153119174 > 2010-12-08 06:43:24,825 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started; Attempting to free 39.93 MB of total=3D339.31 MB > 2010-12-08 06:43:24,829 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed; freed=3D39.95 MB, total=3D299.36 MB, single=3D147.93 MB, multi= =3D154.82 > MB, memory=3D33.28 MB > 2010-12-08 06:43:50,871 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started; Attempting to free 40 MB of total=3D339.39 MB > 2010-12-08 06:43:50,875 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed; freed=3D40.01 MB, total=3D299.37 MB, single=3D147.86 MB, multi= =3D154.96 > MB, memory=3D33.28 MB > 2010-12-08 06:43:57,095 WARN org.apache.hadoop.hbase.regionserver.wal.HLo= g: > IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog; > editcount=3D8882 > 2010-12-08 06:44:25,086 WARN org.apache.hadoop.hbase.regionserver.wal.HLo= g: > IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog; > editcount=3D13054 > 2010-12-08 06:44:55,815 WARN org.apache.hadoop.hbase.regionserver.wal.HLo= g: > IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog; > editcount=3D13472 > 2010-12-08 06:45:33,003 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=3D337.5 = MB, > free=3D61.68 MB, max=3D399.18 MB, blocks=3D9015, accesses=3D553728227, > hits=3D40478965, hitRatio=3D7.31%%, evictions=3D135536, evicted=3D8630645= 3, > evictedPerRun=3D636.77880859375 > 2010-12-08 06:45:45,333 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of > raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global > memstore limit of 798.4m exceeded; currently 812.8m and flushing till 698= .6m > 2010-12-08 06:45:45,363 DEBUG org.apache.hadoop.hbase.regionserver.HRegio= n: > Started memstore flush for region > raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region > memstore size 45.0m > 2010-12-08 06:45:45,364 DEBUG org.apache.hadoop.hbase.regionserver.HRegio= n: > Finished snapshotting, commencing flushing stores > 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming flushed file at > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521= 672231455831 > to > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097= 847060580 > 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store: > Added > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097= 847060580, > entries=3D8356, sequenceid=3D1153215388, memsize=3D1.7m, filesize=3D73.2k= to > raw3,,1291780273350.73d076a68191712c68d16c4805973561. > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client sess= ion > timed out, have not heard from server in 52481ms for sessionid > 0x12c34f1783c0002, closing socket connection and attempting reconnect > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client sess= ion > timed out, have not heard from server in 52481ms for sessionid > 0x12c34f1783c00bf, closing socket connection and attempting reconnect > 2010-12-08 06:46:44,200 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > 2010-12-08 06:46:44,200 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > 2010-12-08 06:46:44,201 WARN > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No lon= ger > connected to ZooKeeper, current state: Disconnected > 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening soc= ket > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2= 181, > initiating session > 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server ip-10-245-202-207.ec2.internal/ > 10.245.202.207:2181, sessionid =3D 0x12c34f1783c00bf, negotiated timeout = =3D > 40000 > 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12c34f1783c00bf closed > 2010-12-08 06:46:51,264 DEBUG > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: > Close= d > connection with ZooKeeper; /hbase/root-region-server > 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening soc= ket > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2= 181, > initiating session > 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired, > closing socket connection > 2010-12-08 06:46:58,491 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Expired, type: None, path: null > 2010-12-08 06:46:58,491 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session > expired > 2010-12-08 06:46:58,491 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region serve= r > serverName=3Dip-10-245-202-207.ec2.internal,60020,1289379196235, > load=3D(requests=3D73, regions=3D649, usedHeap=3D1983, maxHeap=3D1995): Z= ooKeeper > session expired > > > [2] DataNode log: nothing, just noticeable pause for almost 1 minute > > 2010-12-08 06:45:58,120 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, cliID= : > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_128937= 9196335, > offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_6449721912917715607_913675, duration: 3639142 > 2010-12-08 06:45:58,134 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, cliID= : > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_128937= 9196335, > offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_6449721912917715607_913675, duration: 427441 > 2010-12-08 06:45:58,148 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_128937= 9196335, > offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_825672696763040280_900371, duration: 6352102264 > 2010-12-08 06:46:44,192 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /127.0.0.1:50= 010 > 2010-12-08 06:46:51,263 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_128937= 9196335, > offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_8449446729722678098_531164, duration: 401874426400 > 2010-12-08 06:46:51,333 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_128937= 9196335, > offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_-4815658898725325280_624191, duration: 401614939902 > 2010-12-08 06:46:58,266 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls > recoverBlock(block=3Dblk_7609568532460932683_912614, targets=3D[127.0.0.1= :50010 > ]) > > [3] Zookeeper log: some warns > > 2010-12-08 02:50:46,100 - INFO =A0[ProcessThread:-1:PrepRequestProcessor@= 385] > - Processed session termination for sessionid: 0x12c34f1783d6eba > 2010-12-08 02:50:46,104 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for > client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba > 2010-12-08 06:46:44,092 - WARN =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to > read additional data from client sessionid 0x12c34f1783c00bf, likely clie= nt > has closed socket > 2010-12-08 06:46:44,661 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for > client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf > 2010-12-08 06:46:44,662 - WARN =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to > read additional data from client sessionid 0x12c34f1783c0002, likely clie= nt > has closed socket > 2010-12-08 06:46:44,662 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for > client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002 > 2010-12-08 06:46:51,219 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connect= ion > from /10.245.202.207:33388 > 2010-12-08 06:46:51,219 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew sess= ion > 0x12c34f1783c00bf at /10.245.202.207:33388 > 2010-12-08 06:46:51,220 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1579] - Established session > 0x12c34f1783c00bf with negotiated timeout 40000 for client / > 10.245.202.207:33388 > 2010-12-08 06:46:51,220 - INFO =A0[ProcessThread:-1:PrepRequestProcessor@= 385] > - Processed session termination for sessionid: 0x12c34f1783c00bf > 2010-12-08 06:46:51,223 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for > client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf > 2010-12-08 06:46:58,001 - INFO =A0[SessionTracker:ZooKeeperServer@315] - > Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded > 2010-12-08 06:46:58,002 - INFO =A0[ProcessThread:-1:PrepRequestProcessor@= 385] > - Processed session termination for sessionid: 0x12c34f1783c0002 > 2010-12-08 06:46:58,376 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connect= ion > from /10.245.202.207:33394 > 2010-12-08 06:46:58,489 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew sess= ion > 0x12c34f1783c0002 at /10.245.202.207:33394 > 2010-12-08 06:46:58,489 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1572] - Invalid session 0x12c34f1783c0= 002 > for client /10.245.202.207:33394, probably expired > 2010-12-08 06:46:58,490 - INFO =A0[NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for > client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002 > > [4] HMaster log: nothing > > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 648 row(s) of meta region {server: > 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>} > complete > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: = All > 1 .META. region(s) scanned > 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60= 020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} compl= ete > 2010-12-08 06:45:35,367 INFO org.apache.hadoop.hbase.master.ServerManager= : 1 > region servers, 0 dead, average load 649.0 > 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: 10.245.202.207:60= 020, > regionname: .META.,,1.1028785192, startKey: <>} > 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60= 020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-12-08 06:46:35,369 INFO org.apache.hadoop.hbase.master.ServerManager= : 1 > region servers, 0 dead, average load 649.0 > 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} compl= ete > 2010-12-08 06:46:58,005 INFO org.apache.hadoop.hbase.master.ServerManager= : > ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired > 2010-12-08 06:46:58,005 DEBUG org.apache.hadoop.hbase.master.ServerManage= r: > Added=3Dip-10-245-202-207.ec2.internal,60020,1289379196235 to dead server= s, > added shutdown processing operation > 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager= : > -ROOT- region unset (but not set to be reassigned) > 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager= : > META region whose startkey is =A0removed from onlineMetaRegions > 2010-12-08 06:46:58,006 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing tod= o: > ProcessServerShutdown of ip-10-245-202-207.ec2.internal,60020,12893791962= 35 > 2010-12-08 06:46:58,006 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of > server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: fals= e, > rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0 > 2010-12-08 06:46:58,012 INFO org.apache.hadoop.hbase.regionserver.wal.HLo= g: > Splitting 34 hlog(s) in > hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379= 196235 >