hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Crow <dirtyvagab...@yahoo.com>
Subject Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU
Date Thu, 29 Apr 2010 01:29:23 GMT
We were running version 3.2.2 for about a month and it was working well for
us. Then late this past Saturday night, our cluster went pathological. One
of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers
were almost constantly pegging the CPU. All three servers are on separate
machines. From what we could tell, the machines were fine... networking
fine, disk fine, etc. The ZK clients were completely unable to complete
their connections to ZK.

We tried all sorts of restarts, running zkCleanup, etc. We even completely
shut down our clients... and the pathology continued. Our workaround was to
do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
been running well for us... so far...

I realize that, sadly, this message doesn't contain nearly enough details to
trace exactly what happened. I guess I'm wondering if anyone has seen this
general scenario, and/or knows how to prevent? Is there anything we might be
doing client side to trigger this? Our application level request frequency
is maybe a few requests to Zookeeper per second, times 5 clients
applications. If we detect a SESSION EXPIRED, we do a simple "create new
client and use that instead". And we were seeing this happen occasionally.

Many, many thanks in advance for any insights or advice.


Example log output from the ZK server that spewed many WARNs:

===

2010-04-24 22:31:24,551 - WARN
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@318] - Exception when following
the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:66)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.Follower.readPacket(Follower.java:114)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:193)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:525)
2010-04-24 22:31:24,552 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@436] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:436)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:529)
2010-04-24 22:31:24,552 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@514] - LOOKING

...

2010-04-24 22:31:27,257 - WARN  [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2010-04-24 22:31:27,257 - INFO  [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing session:0x0 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/
127.0.0.1:44589]
2010-04-24 22:31:27,705 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@706] - Notification:
3, 60129542145, 2, 1, LOOKING, LEADING, 3
2010-04-24 22:31:27,705 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@523] - FOLLOWING
2010-04-24 22:31:27,705 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@160] - Created server
2010-04-24 22:31:27,706 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@147] - Following sum08/
10.0.10.118:2888
2010-04-24 22:31:29,040 - WARN  [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2010-04-24 22:31:29,040 - INFO  [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing session:0x0 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/
127.0.0.1:44591]
2010-04-24 22:31:31,069 - WARN  [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2010-04-24 22:31:31,069 - INFO  [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing session:0x0 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/
127.0.0.1:44592]
2010-04-24 22:31:32,589 - WARN  [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running

===

Example log output from a server that was pegging CPU:

010-04-24 23:05:27,911 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.
2010-04-24 23:05:27,912 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.
2010-04-24 23:05:27,910 - WARN  [Thread-15:QuorumCnxManager$SendWorker@539]
- Interrupted while waiting for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
        at
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:317)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:533)
2010-04-24 23:05:27,912 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.
2010-04-24 23:05:27,912 - WARN  [Thread-16:QuorumCnxManager$RecvWorker@623]
- Connection broken:
java.nio.channels.AsynchronousCloseException
        at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:263)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:594)
2010-04-24 23:05:27,912 - WARN  [Thread-15:QuorumCnxManager$SendWorker@554]
- Send worker leaving thread

...

010-04-24 23:05:51,952 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.
2010-04-24 23:05:57,148 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.
2010-04-24 23:06:03,243 - INFO  [FollowerHandler-/10.0.10.116:34405
:FollowerHandler@227] - Follower sid: 1 : info :
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@364641
2010-04-24 23:06:03,244 - WARN  [FollowerHandler-/10.0.10.116:34405
:FollowerHandler@302] - Sending snapshot last zxid of peer is 0xd0007d66d
zxid of leader is 0xf00000000
2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.0.10.116:34405
:FollowerHandler@415] - Unexpected exception causing shutdown while sock
still open
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
        at
org.apache.jute.BinaryOutputArchive.writeLong(BinaryOutputArchive.java:59)
        at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:129)
        at
org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
        at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:917)
        at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929)
        at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929)
        at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929)
        at
org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929)
        at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:979)
        at
org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104)
        at
org.apache.zookeeper.server.ZooKeeperServer.serializeSnapshot(ZooKeeperServer.java:272)
        at
org.apache.zookeeper.server.quorum.FollowerHandler.run(FollowerHandler.java:307)
2010-04-24 23:06:03,255 - WARN  [FollowerHandler-/10.0.10.116:34405
:FollowerHandler@421] - ******* GOODBYE /10.0.10.116:34405 ********
2010-04-24 23:06:03,550 - INFO  [WorkerReceiver
Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new
notification.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message