hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: region stuck in failed close state
Date Mon, 30 May 2016 13:57:00 GMT
There is debug log in HRegion#replayWALFlushStartMarker :

              LOG.debug(getRegionInfo().getEncodedName() + " : "

                  + " Prepared flush with seqId:" +
flush.getFlushSequenceNumber());

...

                LOG.debug(getRegionInfo().getEncodedName() + " : "

                  + " Prepared empty flush with seqId:" +
flush.getFlushSequenceNumber());

I searched for them in the log you attached to HBASE-15900 but didn't find
any occurrence.

FYI

On Mon, May 30, 2016 at 2:59 AM, Heng Chen <heng.chen.1986@gmail.com> wrote:

> I  find something useful.
>
> When we do region.close,  if there is one compaction or flush in progress,
> close will wait for compaction or flush be finished.
>
> {code: title=HRegion.java}
>
> @Override
> public void waitForFlushesAndCompactions() {
>   synchronized (writestate) {
>     if (this.writestate.readOnly) {
>       // we should not wait for replayed flushed if we are read only
> (for example in case the
>       // region is a secondary replica).
>       return;
>     }
>     boolean interrupted = false;
>     try {
>       while (writestate.compacting > 0 || writestate.flushing) {
>         LOG.debug("waiting for " + writestate.compacting + " compactions"
>           + (writestate.flushing ? " & cache flush" : "") + " to
> complete for region " + this);
>         try {
>           writestate.wait();
>         } catch (InterruptedException iex) {
>           // essentially ignore and propagate the interrupt back up
>           LOG.warn("Interrupted while waiting");
>           interrupted = true;
>         }
>       }
>     } finally {
>       if (interrupted) {
>         Thread.currentThread().interrupt();
>       }
>     }
>   }
> }
>
> {code}
>
> And writestate.flushing will be set to be true in two place:
>
> HRegion.flushCache and HRegion.replayWALFlushStartMarker
>
> {code: title=HRegion.flushCache}
>
> synchronized (writestate) {
>   if (!writestate.flushing && writestate.writesEnabled) {
>     this.writestate.flushing = true;
>   } else {
>     *....*
>   }
> }
>
> {code}
>
> {code: title=HRegion.replayWALFlushStartMarker}
>
> synchronized (writestate) {
>   try {
>     *....*
>     if (!writestate.flushing) {
>
>         this.writestate.flushing = true;
>         *...*
>
> *     }*
>
> {code}
>
>
> Notice that,  in HRegion.replayWALFlushStartMarker,  we did not check
> writestate.writesEnabled before set writestate.flushing to be true.
>
> So if region.close wake up in writestate.wait, but the lock acquried by
> HRegion.replayWALFlushStartMarker,  the flushing will be set to be true
> again,  and region.close will stuck in writestate.wait forever.
>
>
> Will it happen in real logical?
>
>
> 2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
>
> > Thanks guys,  yesterday i restart relate RS and failed close region
> reopen
> > successfuly.  But today, there is another region fall in this state.
> >
> > I paste relate RS' jstack information.  This time the failed close region
> > is 9368190b3ba46238534b6307702aabae
> >
> > 2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> >
> >> Heng:
> >> Can you pastebin the complete stack trace for the region server ?
> >>
> >> Snippet from region server log may also provide more clue.
> >>
> >> Thanks
> >>
> >> On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
> >> wrote:
> >>
> >> > On master web UI, i could see region
> (c371fb20c372b8edbf54735409ab5c4a)
> >> > always in failed close state,  So balancer could not run.
> >> >
> >> >
> >> > i check the region on RS,  and found logs about this region
> >> >
> >> > 2016-05-26 12:42:10,490 INFO  [MemStoreFlusher.1]
> >> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean
> up
> >> > 'too many store files'; waited long enough... proceeding with flush of
> >> >
> >> >
> >>
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> >> > 2016-05-26 12:42:20,043 INFO
> >> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> >> > regionserver.HRegionServer:
> >> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> >> > requesting flush for region
> >> >
> >> >
> >>
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> >> > after a delay of 20753
> >> > 2016-05-26 12:42:30,043 INFO
> >> >  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> >> > regionserver.HRegionServer:
> >> > dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> >> > requesting flush for region
> >> >
> >> >
> >>
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> >> > after a delay of 7057
> >> >
> >> >
> >> > relate jstack information like below:
> >> >
> >> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
> >> >   State: WAITING
> >> >   Blocked count: 1
> >> >   Waited count: 2
> >> >   Waiting on
> >> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
> >> >   Stack:
> >> >     java.lang.Object.wait(Native Method)
> >> >     java.lang.Object.wait(Object.java:502)
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
> >> >
> >>  org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
> >> >
> >>  org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> >> >
> >> >
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >> >     java.lang.Thread.run(Thread.java:745)
> >> >
> >> >
> >> > Our HBase cluster version is 1.1.1,   i try to compact this region,
> >> > compact stuck in progress 89.58%
> >> >
> >> >
> >> >
> >>
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> >> > 85860221 85860221
> >> > 89.58%
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message