hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Frey <joc...@scoutlabs.com>
Subject Re: Flakey table disable/enable [WAS -> Re: Table disabled but all regions still online?]
Date Thu, 19 Nov 2009 01:32:20 GMT
For what it's worth (since I kind of started this thread) - it's not
important for me. Enable / disable tables is not something we'd do
programatically.

J

On Wed, Nov 18, 2009 at 5:28 PM, Stack <saint.ack@gmail.com> wrote:

> I am torn.  I sort of want to just fix it right in 0.21 but your tm team
> writing such a test would indicate this an impotant feature and maybe we
> should not wait?
>
>
>
>
> On Nov 18, 2009, at 11:13 AM, Andrew Purtell <apurtell@apache.org> wrote:
>
>  There's a team evaluating HBase in Trend that raised this very issue
>> today. This is the test as described:
>> "We execute the following step via Java API:
>>
>>     a.
>> create many tables (about 1000 tables), each table have 10 columns and 20
>> rows
>> (value length is 60-100 bytes)
>>     b. delete some tables
>> (about 10 tables) of these existent tables
>>     c. create
>> some new tables (about 10 tables), each table have 10 columns and 20 rows
>> (value
>> length is 60-100 bytes)
>>     d. repeat step b and step
>> c
>>    Execute these step about 6-10 hours, one of these tables will
>> not be able to disabled."
>> The test cluster is an 8 node setup. This is 0.20.2 RC1.
>>
>> They have a wedged table available for examination. I have not gone on yet
>> and looked around or tried anything like close_region etc. If you want to go
>> on to the cluster and have a look around, I can arrange that.
>>
>> My suggestion was to avoid using temporary tables in HBase like one might
>> use with a RDBMS -- create one or maybe just a few tables for containing
>> temporary values, use TTLs as appropriate, and prepend strings to keys for
>> example foo_key_1, bar_key_1, etc. such that it's equivalent to storing
>> key_1 in temp tables foo and bar.
>>
>> I do think making enable/disable table less flaky in 0.20 is worth some
>> effort. I think few (if any) of us using HBase in production disable or
>> delete tables unless for some exceptional reason, but evaluators try it --
>> perhaps because they are used to creating and dropping temporary tables on
>> the RDBMS all the time -- and then become concerned.
>>
>>  - Andy
>>
>>
>>
>>
>> ________________________________
>> From: stack <stack@duboce.net>
>> To: hbase-user@hadoop.apache.org
>> Sent: Wed, November 18, 2009 10:41:16 AM
>> Subject: Flakey table disable/enable [WAS -> Re: Table disabled but all
>>  regions still online?]
>>
>> On Wed, Nov 18, 2009 at 8:10 AM, Jochen Frey <jochen@scoutlabs.com>
>> wrote:
>> ..
>>
>>
>>> However, at the same time all there regions are still online, which I can
>>> verify by way of the web interface as well as the command line interface
>>> (>
>>> 400 regions).
>>>
>>> This has happened at least twice by now. The first time I was able to
>>> "fix"
>>> it by restarting HDFS, the second time restarting didn't fix it.
>>>
>>>
>>>  In 0.20.x hbase, enable/disable of tables is unreliable as written.  It
>> will
>> work when tables are small or we're in a unit test context where
>> configuration makes messaging more lively but it quickly turns flakey if
>> your table has any more than a few regions.
>>
>> Currently, the way it works is to message the master to run a processing
>> of
>> all regions that make up a table.  The client waits under a timeout
>> continually checking for all regions are offline but if table is large,
>> client will often timeout before master finishes.  Master is running
>> process
>> in a worker thread in-series updating .META. table and flagging
>> RegionServers one at a time that they need to close a region on disable.
>> Closing a region entails flushing memstore so can take a while.  Running
>> in
>> the master context is sort of necessary because regions may be in a state
>> of
>> transition and master is the place where this is kept so it knows how to
>> intercept region transitions in case where being asked to online/offline.
>>
>> The master is being rewritten for hbase 0.21.  This is one area that is
>> being completely redone.  See
>> http://wiki.apache.org/hadoop/Hbase/MasterRewrite for the high-level
>> design
>> sketch and then
>> https://issues.apache.org/jira/browse/HBASE-1730"Near-instantaneous
>> online schema and table state updates" for explicit
>> discussion of how we're to do table state transistions.
>>
>> Enable/disable has been flakey for a while (See
>> https://issues.apache.org/jira/browse/HBASE-1636).  My understanding is
>> that
>> it will work eventually if you keep trying (maybe this is wrong?)  so I've
>> always thought it down on the list of priorities and something we've
>> scheduled to fix properly in 0.21.  But you are the second fellow who has
>> raised enable/disable as a problem during an evaluation and I'm a little
>> concerned that flakey enable/disable is earning us a black mark.  If its
>> important, I hope folks will flag it so.  In 0.20.x context, we could hack
>> up a script to run the table enable/disable in parallel.  It'd scan
>> .META.,
>> sort by servers, write close messages to each regionserver and rewrite the
>> table .META.  It could then just wait till all report disabled perhaps
>> resignalling if necessary.  If you just want to kill the table, such a
>> script may already exist for you.  See
>> https://issues.apache.org/jira/browse/HBASE-1872.
>>
>> Thanks,
>> St.Ack
>>
>>
>>
>>  The first time this happened, we had a lot going on (rolling restart of
>>> the
>>> hbase nodes), hdfs balancer running. The second time I found the
>>> following
>>> exception in the master log (below). Can anyone shed some light on this
>>> or
>>> tell me what additional information would be helpful for debugging?
>>>
>>> Thanks so much!
>>> Jochen
>>>
>>>
>>> 2009-11-17 20:59:12,751 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>> 8
>>> region servers, 0 dead, average load 50.25
>>> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
>>> RegionManager.rootScanner scanning meta region {server:
>>> 10.10.0.177:60020,
>>> regionname: -ROOT-,,0, startKey: <>}
>>> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
>>> RegionManager.metaScanner scanning meta region {server:
>>> 10.10.0.189:60020,
>>> regionname: .META.,,1, startKey: <>}
>>> 2009-11-17 20:59:13,620 INFO org.apache.hadoop.hbase.master.BaseScanner:
>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
>>> 10.10.0.177:60020, regionname: -ROOT-,,0, startKey: <>} complete
>>> 2009-11-17 20:59:13,622 WARN org.apache.hadoop.hbase.master.BaseScanner:
>>> Scan one META region: {server: 10.10.0.189:60020, regionname: .META.,,1,
>>> startKey: <>}
>>> java.net.ConnectException: Connection refused
>>>      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>      at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>>      at
>>>
>>>
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>>>      at
>>>
>>>
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
>>>      at
>>>
>>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
>>>      at
>>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
>>>      at
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>>>      at $Proxy6.openScanner(Unknown Source)
>>>      at
>>>
>>> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
>>>      at
>>>
>>>
>>> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>      at
>>>
>>>
>>> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>      at
>>> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
>>>      at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>> 2009-11-17 20:59:13,623 INFO org.apache.hadoop.hbase.master.BaseScanner:
>>> All
>>> 1 .META. region(s) scanned
>>> d
>>>
>>>
>>> --
>>> Jochen Frey . CTO
>>> Scout Labs
>>> 415.366.0450
>>> www.scoutlabs.com
>>>
>>>
>>
>>
>>


-- 
Jochen Frey . CTO
Scout Labs
415.366.0450
www.scoutlabs.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message