hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Hbase Master Failover Issue
Date Sat, 14 May 2011 01:17:43 GMT
Ok i think the issue is largely solved. Thanks for your help, guys.

-d

On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> ok the problem seems to be multi-nic hosting on masters. the hbase
> master starts up and uses canonical hostname to listen on which points
> to a wrong nic. I am not sure why so i am not changign this but i am
> struggling to override this at the moment as nothing seems to work
> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
> possible combinatiosn... it probably has something to do with reverse
> lookup so i added entry to hosts files to no avail so far. i will have
> to talk to our admins to see why we can't switch the canonical host
> name to ip that all the nodes are supposed to use it with .
>
> thanks.
> -d
>
> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> Thanks, Jean-Daniel.
>>
>> Logs don't show anything abnormal (not even warnings). How soon you
>> think the region servers should join?
>>
>> I am guessing the sequence should be something along the lines --
>>  zookeeper needs to timeout old master session first (2 mins or so ) ,
>> then hot spare should wean next master election (we probably should
>> see that happening if we can tail its log, right?)
>> and then the rest of the crowd should join in something like what
>> seems to be governed by hbase.regionserver.msginterval property , if i
>> read the code correctly?
>>
>> So all -in -all probably something like 3 minutes should warrant
>> everybody has found the new master one way or another , right? if not,
>> we have a problem, right?
>>
>> Thanks.
>> -Dmitriy
>>
>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
>> <jdcryans@apache.org> wrote:
>>> Maybe there is something else in there, would be useful to see logs
>>> from the region servers when you are shutting down master 1 and
>>> bringing up master2.
>>>
>>> About "I have no failover for a critical component of my
>>> infrastructure.", so is the Namenode, and for the moment you can't do
>>> much about it. What's usually recommended is to put both the master
>>> and the NN together on a more reliable machine. And the master ain't
>>> that critical, almost everything works without it.
>>>
>>> J-D
>>>
>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <sbarden@gmail.com> wrote:
>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the
>>>> upgrade).  Hbase failed over to it's "backup" master server just find
>>>> in the older version.  As 0.90.1+15.18, I had hoped the fix would be
>>>> in u0 for the failover issue.  However, I'm having the same issue.
>>>> master1 fails or I shut it down,  master2 waits for RS'es to check in
>>>> forever.  Restarting the services for master2 and all RS's does
>>>> nothing until I start up master1.  So, essentially, I have no failover
>>>> for a critical component of my infrastructure.  Needless to say I'm
>>>> exceptionally frustrated.  Any ideas to a fix or workaround would be
>>>> greatly appreciated.
>>>>
>>>> Regards,
>>>>
>>>> Sean
>>>>
>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>>
>>>>> J-D
>>>>>
>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
>>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>>
>>>>>> thx,
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm testing failing over from one master to another by stopping
>>>>>>>> master1(master2 is always running).  Master2 web i/f kicks
in and I can
>>>>>>>> zk_dump but the region servers never show up.  Logs on master2
show repeated
>>>>>>>> entries below:
>>>>>>>>
>>>>>>>> 2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>> 2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>
>>>>>>>> Obviously the RS are not checking in.  Not sure why.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> thx,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sean Barden
>>>>>>>> sbarden@gmail.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sean Barden
>>>>>> sbarden@gmail.com
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sean Barden
>>>> sbarden@gmail.com
>>>>
>>>
>>
>

Mime
View raw message