hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Flakey table disable/enable [WAS -> Re: Table disabled but all regions still online?]
Date Wed, 18 Nov 2009 18:41:16 GMT
On Wed, Nov 18, 2009 at 8:10 AM, Jochen Frey <jochen@scoutlabs.com> wrote:
..

>
> However, at the same time all there regions are still online, which I can
> verify by way of the web interface as well as the command line interface (>
> 400 regions).
>
> This has happened at least twice by now. The first time I was able to "fix"
> it by restarting HDFS, the second time restarting didn't fix it.
>
>
In 0.20.x hbase, enable/disable of tables is unreliable as written.  It will
work when tables are small or we're in a unit test context where
configuration makes messaging more lively but it quickly turns flakey if
your table has any more than a few regions.

Currently, the way it works is to message the master to run a processing of
all regions that make up a table.  The client waits under a timeout
continually checking for all regions are offline but if table is large,
client will often timeout before master finishes.  Master is running process
in a worker thread in-series updating .META. table and flagging
RegionServers one at a time that they need to close a region on disable.
Closing a region entails flushing memstore so can take a while.  Running in
the master context is sort of necessary because regions may be in a state of
transition and master is the place where this is kept so it knows how to
intercept region transitions in case where being asked to online/offline.

The master is being rewritten for hbase 0.21.  This is one area that is
being completely redone.  See
http://wiki.apache.org/hadoop/Hbase/MasterRewrite for the high-level design
sketch and then
https://issues.apache.org/jira/browse/HBASE-1730"Near-instantaneous
online schema and table state updates" for explicit
discussion of how we're to do table state transistions.

Enable/disable has been flakey for a while (See
https://issues.apache.org/jira/browse/HBASE-1636).  My understanding is that
it will work eventually if you keep trying (maybe this is wrong?)  so I've
always thought it down on the list of priorities and something we've
scheduled to fix properly in 0.21.  But you are the second fellow who has
raised enable/disable as a problem during an evaluation and I'm a little
concerned that flakey enable/disable is earning us a black mark.  If its
important, I hope folks will flag it so.  In 0.20.x context, we could hack
up a script to run the table enable/disable in parallel.  It'd scan .META.,
sort by servers, write close messages to each regionserver and rewrite the
table .META.  It could then just wait till all report disabled perhaps
resignalling if necessary.  If you just want to kill the table, such a
script may already exist for you.  See
https://issues.apache.org/jira/browse/HBASE-1872.

Thanks,
St.Ack



> The first time this happened, we had a lot going on (rolling restart of the
> hbase nodes), hdfs balancer running. The second time I found the following
> exception in the master log (below). Can anyone shed some light on this or
> tell me what additional information would be helpful for debugging?
>
> Thanks so much!
> Jochen
>
>
> 2009-11-17 20:59:12,751 INFO org.apache.hadoop.hbase.master.ServerManager:
> 8
> region servers, 0 dead, average load 50.25
> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.10.0.177:60020,
> regionname: -ROOT-,,0, startKey: <>}
> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 10.10.0.189:60020,
> regionname: .META.,,1, startKey: <>}
> 2009-11-17 20:59:13,620 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.10.0.177:60020, regionname: -ROOT-,,0, startKey: <>} complete
> 2009-11-17 20:59:13,622 WARN org.apache.hadoop.hbase.master.BaseScanner:
> Scan one META region: {server: 10.10.0.189:60020, regionname: .META.,,1,
> startKey: <>}
> java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>        at
>
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>        at $Proxy6.openScanner(Unknown Source)
>        at
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
>        at
>
> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>        at
>
> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>        at
> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> 2009-11-17 20:59:13,623 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All
> 1 .META. region(s) scanned
> d
>
>
> --
> Jochen Frey . CTO
> Scout Labs
> 415.366.0450
> www.scoutlabs.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message