hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Not running balancer because processing dead regionserver(s)
Date Tue, 22 Feb 2011 22:25:39 GMT
On Mon, Feb 21, 2011 at 10:04 PM, Yi Liang <whitesky@gmail.com> wrote:
> Yes, the server zcl crashed at that time.
>
> But after I restarted it later, it's still in the dead server list.
>

We failed processing its death:

2011-02-18 10:08:14,873 ERROR org.apache.hadoop.hbase.HServerAddress:
Could not resolve the DNS name of zcl.local:60020
2011-02-18 10:08:14,874 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.IllegalArgumentException: Could not resolve the DNS name of
zcl.local:60020
        at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105)
        at org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66)
        at org.apache.hadoop.hbase.catalog.MetaReader.metaRowToRegionPairWithInfo(MetaReader.java:407)
        at org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:594)
        at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:124)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

It looks like the above exception caused us to jump out of the
processing of the server shutdown.  Above is related to the no route
to host.

I filed HBASE-3556.  It'll be 'fixed' by HBASE-1501 but we should
never just give up processing.  Need to look into that.

While a server is in the dead servers list, we'll not run the
balancer.  The dead servers list is an in-memory list.  You'd need to
kill the master and bring it back up again to rid the dead server
state.

St.Ack


> 2011-02-18 10:39:26,895 INFO org.apache.hadoop.hbase.master.ServerManager:
> Registering server=zcl.local,60020,1297996817352, regionCount=0,
> userLoad=false
> 2011-02-18 10:39:35,062 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> running balancer because processing dead regionserver(s):
> [Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
> zcl.local,60020,1297919367472]
>
> On Tue, Feb 22, 2011 at 1:48 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Looks like there was connectivity issue:
>>
>> java.net.NoRouteToHostException: No route to host
>>
>> On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <whitesky@gmail.com> wrote:
>>
>> > The related log is at: http://pastebin.com/0a1CjDUD
>> >
>> > It's ok now after restarting hbase, but still curious why it happend.
>> >
>> > Thanks,
>> > Yi
>> > On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans <jdcryans@apache.org
>> > >wrote:
>> >
>> > > The master should finish processing those dead servers at some point
>> > > and it seems it's not happening? Unfortunately without the log nobody
>> > > can'tell why. If you can post the complete log in pastebin or put it
>> > > on a web server then we could take a look.
>> > >
>> > > J-D
>> > >
>> > > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <whitesky@gmail.com> wrote:
>> > > > Hi all,
>> > > >
>> > > > We have a hbase cluster with 10 region servers running HBase 0.90.0
+
>> > > CDH3.
>> > > > We're now importing big data into HBase.
>> > > >
>> > > > During the process, 2 servers crashed, but after restaring them,
>> > they're
>> > > no
>> > > > longer assigned with any region, while regions on other servers keep
>> > > > splitting when more data inserted.
>> > > >
>> > > > From the master log, we can see the periodical messages like:
>> > > >
>> > > > 2011-02-18 16:09:35,067 DEBUG org.apache.hadoop.hbase.master.HMaster:
>> > Not
>> > > > running balancer because processing dead regionserver(s):
>> > > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488,
>> > > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796,
>> > > > zcl.local,60020,1297919367472]
>> > > >
>> > > > zcl.local and qics.local are the machines we have restared, other
2
>> > > machine
>> > > > have kept running without restarting and are actually still serving
>> > > regions.
>> > > >
>> > > > From the shell status:
>> > > > 10 servers, 5 dead, 10.1000 average Load
>> > > >
>> > > > Why are there dead servers? And how to clear them so we could start
>> > > > balancer?
>> > > >
>> > > > Thanks,
>> > > > Yi
>> > > >
>> > >
>> >
>>
>

Mime
View raw message