hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brodsky <danbrod...@gmail.com>
Subject Re: Regionservers not connecting to master
Date Wed, 17 Oct 2012 17:29:54 GMT

Thanks for your suggestions.

The datanodes are all built using the same image, so I know they're
all pointed to the same ZK nodes.

I monitored all three ZK logs, the master log, and the regionserver
log for each RS I was trying to bring back online. I'm glad I have a
big screen. :-) Here is what I found:

Whenever a regionserver connects to one particular ZK peer *first*, it
never goes online. The ZK log shows a successful connection
negotiating a timeout value, and the RS's log shows a successful ZK
connection, but then it just sits there.

When a regionserver starts up and connects to one of the other two ZK
peers first, it connects to a second one successfully, then contacts
the master, and it comes up and all is happy.

So the problem of regionservers not connecting to master only happens
when the RS tries one particular ZK node as its first ZK connection.
But the logs aren't helpful for diagnosing further than that.

Additional thoughts?

On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan
<ramkrishna.vasudevan@huawei.com> wrote:
> Can you try like start any of the regionservers that are not connecting at
> all.  May be start 2 of them.
> Observer master logs.  See whether it says
> 'Waiting for RegionServers to checkin'?.
> Just to confirm your ZK ip and port is correct thro out the cluster? If
> multitenant cluster then you may be the other regionservers are connecting
> to someother ZK cluster?
> Wild guess :)
> Regards
> Ram
>> -----Original Message-----
>> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
>> Sent: Wednesday, October 17, 2012 6:31 PM
>> To: user@hbase.apache.org
>> Subject: Regionservers not connecting to master
>> Good morning,
>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
>> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
>> regionservers.
>> Several weeks ago, we had six HDFS datanodes go offline suddenly (with
>> no meaningful error messages), and since then, I have been unable to
>> get all 10 regionservers to connect to the Hbase master. I've tried
>> bringing the cluster down and rebooting all the boxes, but no joy. The
>> machines are all running, and hbase-regionserver appears to start
>> normally on each one.
>> Right now, my master status page (http://namenode:60010) shows 3
>> regionservers online. There are also dozens of regions in transition
>> listed on the status page (in the PENDING_OPEN state), but each of
>> those are on one of the regionservers already online.
>> The 7 other regionservers' log files show a successful connection to
>> one ZK peer, followed by a regular trail of these messages:
>> 2012-10-17 12:36:08,394 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
>> hitRatio=0cachingAccesses=0, cachingHits=0,
>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
>> If I had to wager a guess, it seems like the 7 offline regionservers
>> are not connecting to other ZK peers, but there isn't anything in the
>> ZK logs to indicate why.
>> Thoughts?
>> Dan

View raw message