From user-return-51443-apmail-hbase-user-archive=hbase.apache.org@hbase.apache.org Mon May 30 09:59:41 2016 Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1B3619335 for ; Mon, 30 May 2016 09:59:41 +0000 (UTC) Received: (qmail 57047 invoked by uid 500); 30 May 2016 09:59:39 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 56972 invoked by uid 500); 30 May 2016 09:59:39 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 56954 invoked by uid 99); 30 May 2016 09:59:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 May 2016 09:59:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F050F1804C2 for ; Mon, 30 May 2016 09:59:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 3yot2Bre6zzb for ; Mon, 30 May 2016 09:59:37 +0000 (UTC) Received: from mail-yw0-f175.google.com (mail-yw0-f175.google.com [209.85.161.175]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 630265F36F for ; Mon, 30 May 2016 09:59:36 +0000 (UTC) Received: by mail-yw0-f175.google.com with SMTP id c127so158708347ywb.1 for ; Mon, 30 May 2016 02:59:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=nCQyyiFZtacsd1lW3MEK+RuENSqQl6oTx5zpLoDBuTA=; b=qHO40FtwDPjFFep5EJmXj/vqeAE5NjrzV6gZw48uXCwEXp83NDyssZ87XQON1A1kKr FWxa+FRLdsJmlE2BiqQZoJKS3PHJvtC2adhMzeygYLT4+V47JEPdZl3FsGiTURV7sUdO R2wmCU/2vnipFpUpFh22Qomoc60NlyJfcBbeKLNmluPZcQ+uUqQBxCMnwUVPxBtlGosR m20d+Ur2wqTJUalAoI5FFCXy9e5geSuuOaFAB6ziNDfRxjwO0CGtZgbAO+KV/+L1RuAp Ufc+IluMrFGLF8y2s9eIEAR8EINxEKz9pGQC8dE3fgp9RwvYfap++TAxJAbNyRedUv/w 8bMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=nCQyyiFZtacsd1lW3MEK+RuENSqQl6oTx5zpLoDBuTA=; b=LXSoyS8VXJAlYa6fSrGTXT1q60askzU1NC7VaunCBqhxXqA0kZ8kUHFCtIxdnggCdw rq2g0y1xI2Ikgw8PDx/frAnUZ85ZRapUaCi2g/2kWhHSI9eDG8V+eO65+XDlecgPl58w 84NKMcxZwLEbCufpSYwDarj5oXqA25afFTvMHLqY4PktrqecLTXW8mh73UwX/vxSSx2n 4FzSyFl2vIC9wXFDv2G2wUGQpXxAy0Bimx7EPIa1KAiXmRzE7PeJN56YFREydooj4ugH iM8dTqb6Wq5mIHsi1VbDOZbKZIWX64pb4eQJyMPRn0FPdekNoiQ52AG/6r86ZgiFr1aU p4vA== X-Gm-Message-State: ALyK8tKyW1M5290A1pYj0ax6n/wEJsZTwEvs0AyA3ZIBEjPuz1itSe1DXX1TsiblZQczfVDGMIQvao+9RY5UfQ== MIME-Version: 1.0 X-Received: by 10.13.209.133 with SMTP id t127mr18443210ywd.232.1464602375198; Mon, 30 May 2016 02:59:35 -0700 (PDT) Received: by 10.37.32.137 with HTTP; Mon, 30 May 2016 02:59:35 -0700 (PDT) In-Reply-To: References: Date: Mon, 30 May 2016 17:59:35 +0800 Message-ID: Subject: Re: region stuck in failed close state From: Heng Chen To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a114e42cc4fa2ed05340c500f --001a114e42cc4fa2ed05340c500f Content-Type: text/plain; charset=UTF-8 I find something useful. When we do region.close, if there is one compaction or flush in progress, close will wait for compaction or flush be finished. {code: title=HRegion.java} @Override public void waitForFlushesAndCompactions() { synchronized (writestate) { if (this.writestate.readOnly) { // we should not wait for replayed flushed if we are read only (for example in case the // region is a secondary replica). return; } boolean interrupted = false; try { while (writestate.compacting > 0 || writestate.flushing) { LOG.debug("waiting for " + writestate.compacting + " compactions" + (writestate.flushing ? " & cache flush" : "") + " to complete for region " + this); try { writestate.wait(); } catch (InterruptedException iex) { // essentially ignore and propagate the interrupt back up LOG.warn("Interrupted while waiting"); interrupted = true; } } } finally { if (interrupted) { Thread.currentThread().interrupt(); } } } } {code} And writestate.flushing will be set to be true in two place: HRegion.flushCache and HRegion.replayWALFlushStartMarker {code: title=HRegion.flushCache} synchronized (writestate) { if (!writestate.flushing && writestate.writesEnabled) { this.writestate.flushing = true; } else { *....* } } {code} {code: title=HRegion.replayWALFlushStartMarker} synchronized (writestate) { try { *....* if (!writestate.flushing) { this.writestate.flushing = true; *...* * }* {code} Notice that, in HRegion.replayWALFlushStartMarker, we did not check writestate.writesEnabled before set writestate.flushing to be true. So if region.close wake up in writestate.wait, but the lock acquried by HRegion.replayWALFlushStartMarker, the flushing will be set to be true again, and region.close will stuck in writestate.wait forever. Will it happen in real logical? 2016-05-27 10:44 GMT+08:00 Heng Chen : > Thanks guys, yesterday i restart relate RS and failed close region reopen > successfuly. But today, there is another region fall in this state. > > I paste relate RS' jstack information. This time the failed close region > is 9368190b3ba46238534b6307702aabae > > 2016-05-26 21:50 GMT+08:00 Ted Yu : > >> Heng: >> Can you pastebin the complete stack trace for the region server ? >> >> Snippet from region server log may also provide more clue. >> >> Thanks >> >> On Wed, May 25, 2016 at 9:48 PM, Heng Chen >> wrote: >> >> > On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a) >> > always in failed close state, So balancer could not run. >> > >> > >> > i check the region on RS, and found logs about this region >> > >> > 2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1] >> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up >> > 'too many store files'; waited long enough... proceeding with flush of >> > >> > >> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. >> > 2016-05-26 12:42:20,043 INFO >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1] >> > regionserver.HRegionServer: >> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore >> > requesting flush for region >> > >> > >> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. >> > after a delay of 20753 >> > 2016-05-26 12:42:30,043 INFO >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1] >> > regionserver.HRegionServer: >> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore >> > requesting flush for region >> > >> > >> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. >> > after a delay of 7057 >> > >> > >> > relate jstack information like below: >> > >> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2): >> > State: WAITING >> > Blocked count: 1 >> > Waited count: 2 >> > Waiting on >> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c >> > Stack: >> > java.lang.Object.wait(Native Method) >> > java.lang.Object.wait(Object.java:502) >> > >> > >> org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512) >> > >> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371) >> > >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336) >> > >> > >> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138) >> > >> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) >> > >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> > java.lang.Thread.run(Thread.java:745) >> > >> > >> > Our HBase cluster version is 1.1.1, i try to compact this region, >> > compact stuck in progress 89.58% >> > >> > >> > >> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. >> > 85860221 85860221 >> > 89.58% >> > >> > > --001a114e42cc4fa2ed05340c500f--