helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lei Xia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage
Date Thu, 27 Oct 2016 22:19:58 GMT

    [ https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613447#comment-15613447
] 

Lei Xia commented on HELIX-608:
-------------------------------

There is a bug in the zkclient lib we are using. In zkClient.java, _connection and _connection.getZookeeper()
never returns null until the client is explicitly closed. And once it is closed, a flag (_closed)
is set.  This flag is checked in retryUntilConnected() before calling callback.  For this
reason, neither Helix's extended zkClient nor the original zkClient checks the null pointer
in its various retry-able operations.

protected boolean exists(final String path, final boolean watch) {
    ......
    try {
      return retryUntilConnected(new Callable<Boolean>() {
        @Override
        public Boolean call() throws Exception {
          return _connection.exists(path, watch);
        }
      });
     .....
    }

public <T> T retryUntilConnected(Callable<T> callable) throws ZkInterruptedException,
IllegalArgumentException, ZkException, RuntimeException {
        .....
        while (true) {
            if (_closed) {
                throw new IllegalStateException("ZkClient already closed!");
            }
            try {
                return callable.call();
            } catch (ConnectionLossException e) {
               ...
                waitForRetry();
            } catch (SessionExpiredException e) {
               ....
                waitForRetry();
            } catch (KeeperException e) {
                throw ZkException.create(e);
            } catch (InterruptedException e) {
                throw new ZkInterruptedException(e);
            } catch (Exception e) {
                throw ExceptionUtil.convertToRuntimeException(e);
            }
            .....
        }
    }

  However, there is a bug in reconnect(), which closes the _connection, and reconnect it.
 It does not set _closed flag after close the connection, so if reconnect fails, then reconnect()
returns with _connection be null and _closed not set. We then see NPE if there are still pending
read/writes to retry.

private void reconnect() {
        getEventLock().lock();
        try {
            _connection.close();
            _connection.connect(this);
        } catch (InterruptedException e) {
            throw new ZkInterruptedException(e);
        } finally {
            getEventLock().unlock();
        }
    }
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java

  The right way is to fix reconnect(), however, since it is private method, Helix can not
override it.   This NPE exception happens when the client fails to reconnect to zk server,
which should be rare given zookeeper is supposed to be highly available.  However, once it
happens, even if Helix checks it against null, we can do nothing more than throw a different
exception.  Instead, I will open a ticket to zkClient open source community to convince them
to fix the problem.




> NPE and unable to reconnect to zookeeper after a network outage
> ---------------------------------------------------------------
>
>                 Key: HELIX-608
>                 URL: https://issues.apache.org/jira/browse/HELIX-608
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.7.1
>            Reporter: Changgeng Li
>            Assignee: Lei Xia
>
> I noticed one of the participant is not a live instance in zookeeper after a network
outage, while the java process is live. I have to restart the java process to make it live
again. 
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread] org.apache.zookeeper.ClientCnxn: Error
while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
>         at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) ~[zaaa.jar:?]
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
[zaaa.jar:?]
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to zzookeeperhost:2181,zookeeperhost2.com:2181/a
>         at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) ~[zaaa.jar:?]
>         ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure in name resolution
>         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.7.0_72]
>         at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) ~[?:1.7.0_72]
>         at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) ~[?:1.7.0_72]
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1246) ~[?:1.7.0_72]
>         at java.net.InetAddress.getAllByName(InetAddress.java:1162) ~[?:1.7.0_72]
>         at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[?:1.7.0_72]
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:387) ~[zaaa.jar:?]
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:332) ~[zaaa.jar:?]
>         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:383) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) ~[zaaa.jar:?]
>         ... 3 more
> INFO  2015-07-28 17:12:15,010 [main-EventThread] org.apache.zookeeper.ClientCnxn: EventThread
shut down
> ERROR 2015-07-28 17:12:15,014 [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a]
org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES
changed sent to org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa]
> java.lang.NullPointerException
>         at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [zaaa.jar:?]
> ERROR 2015-07-28 17:12:15,015 [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a]
org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of /zaaa/EXTERNALVIEW
changed sent to org.apache.helix.manager.zk.ZkCallbackHandler@35d1655]
> java.lang.NullPointerException
>         at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) ~[zaaa.jar:?]
>         at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) [zaaa.jar:?]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message