@Ted. the log i paste is in INFO Level, i have change it to be DEBUG
Level yesterday, If this happen again, i will upload the debug level log.
2016-05-30 21:57 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> There is debug log in HRegion#replayWALFlushStartMarker :
>
> LOG.debug(getRegionInfo().getEncodedName() + " : "
>
> + " Prepared flush with seqId:" +
> flush.getFlushSequenceNumber());
>
> ...
>
> LOG.debug(getRegionInfo().getEncodedName() + " : "
>
> + " Prepared empty flush with seqId:" +
> flush.getFlushSequenceNumber());
>
> I searched for them in the log you attached to HBASE-15900 but didn't find
> any occurrence.
>
> FYI
>
> On Mon, May 30, 2016 at 2:59 AM, Heng Chen <heng.chen.1986@gmail.com>
> wrote:
>
> > I find something useful.
> >
> > When we do region.close, if there is one compaction or flush in
> progress,
> > close will wait for compaction or flush be finished.
> >
> > {code: title=HRegion.java}
> >
> > @Override
> > public void waitForFlushesAndCompactions() {
> > synchronized (writestate) {
> > if (this.writestate.readOnly) {
> > // we should not wait for replayed flushed if we are read only
> > (for example in case the
> > // region is a secondary replica).
> > return;
> > }
> > boolean interrupted = false;
> > try {
> > while (writestate.compacting > 0 || writestate.flushing) {
> > LOG.debug("waiting for " + writestate.compacting + " compactions"
> > + (writestate.flushing ? " & cache flush" : "") + " to
> > complete for region " + this);
> > try {
> > writestate.wait();
> > } catch (InterruptedException iex) {
> > // essentially ignore and propagate the interrupt back up
> > LOG.warn("Interrupted while waiting");
> > interrupted = true;
> > }
> > }
> > } finally {
> > if (interrupted) {
> > Thread.currentThread().interrupt();
> > }
> > }
> > }
> > }
> >
> > {code}
> >
> > And writestate.flushing will be set to be true in two place:
> >
> > HRegion.flushCache and HRegion.replayWALFlushStartMarker
> >
> > {code: title=HRegion.flushCache}
> >
> > synchronized (writestate) {
> > if (!writestate.flushing && writestate.writesEnabled) {
> > this.writestate.flushing = true;
> > } else {
> > *....*
> > }
> > }
> >
> > {code}
> >
> > {code: title=HRegion.replayWALFlushStartMarker}
> >
> > synchronized (writestate) {
> > try {
> > *....*
> > if (!writestate.flushing) {
> >
> > this.writestate.flushing = true;
> > *...*
> >
> > * }*
> >
> > {code}
> >
> >
> > Notice that, in HRegion.replayWALFlushStartMarker, we did not check
> > writestate.writesEnabled before set writestate.flushing to be true.
> >
> > So if region.close wake up in writestate.wait, but the lock acquried by
> > HRegion.replayWALFlushStartMarker, the flushing will be set to be true
> > again, and region.close will stuck in writestate.wait forever.
> >
> >
> > Will it happen in real logical?
> >
> >
> > 2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
> >
> > > Thanks guys, yesterday i restart relate RS and failed close region
> > reopen
> > > successfuly. But today, there is another region fall in this state.
> > >
> > > I paste relate RS' jstack information. This time the failed close
> region
> > > is 9368190b3ba46238534b6307702aabae
> > >
> > > 2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > >> Heng:
> > >> Can you pastebin the complete stack trace for the region server ?
> > >>
> > >> Snippet from region server log may also provide more clue.
> > >>
> > >> Thanks
> > >>
> > >> On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
> > >> wrote:
> > >>
> > >> > On master web UI, i could see region
> > (c371fb20c372b8edbf54735409ab5c4a)
> > >> > always in failed close state, So balancer could not run.
> > >> >
> > >> >
> > >> > i check the region on RS, and found logs about this region
> > >> >
> > >> > 2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
> > >> > regionserver.MemStoreFlusher: Waited 90447ms on a compaction to
> clean
> > up
> > >> > 'too many store files'; waited long enough... proceeding with flush
> of
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > 2016-05-26 12:42:20,043 INFO
> > >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> > >> > regionserver.HRegionServer:
> > >> >
> dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> > >> > requesting flush for region
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > after a delay of 20753
> > >> > 2016-05-26 12:42:30,043 INFO
> > >> > [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
> > >> > regionserver.HRegionServer:
> > >> >
> dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
> > >> > requesting flush for region
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > after a delay of 7057
> > >> >
> > >> >
> > >> > relate jstack information like below:
> > >> >
> > >> > Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
> > >> > State: WAITING
> > >> > Blocked count: 1
> > >> > Waited count: 2
> > >> > Waiting on
> > >> > org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
> > >> > Stack:
> > >> > java.lang.Object.wait(Native Method)
> > >> > java.lang.Object.wait(Object.java:502)
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
> > >> >
> > >>
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
> > >> >
> > >> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
> > >> >
> > >> >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > >> > java.lang.Thread.run(Thread.java:745)
> > >> >
> > >> >
> > >> > Our HBase cluster version is 1.1.1, i try to compact this region,
> > >> > compact stuck in progress 89.58%
> > >> >
> > >> >
> > >> >
> > >>
> >
> frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
> > >> > 85860221 85860221
> > >> > 89.58%
> > >> >
> > >>
> > >
> > >
> >
>
|