flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Yao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
Date Tue, 14 May 2019 08:05:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839189#comment-16839189
] 

Gary Yao commented on FLINK-12384:
----------------------------------

[~haf] ping

> Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12384
>                 URL: https://issues.apache.org/jira/browse/FLINK-12384
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Henrik
>            Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper 
- Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn 
- SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration
section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'.
Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server
allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn 
- Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      
- Using configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState 
- Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils        
- Trying to start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn 
- Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181,
initiating session
> [tm] 2019-05-01 13:30:53,500 WARN  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket 
- Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn 
- Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181,
sessionid = 0xbf06a739001d446, negotiated timeout = 60000
> [tm] 2019-05-01 13:30:53,525 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager 
- State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front.
Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that
the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> ---
> Workaround: clean out the etcd cluster and remove all its data, however, this resets
all time windows and state, despite having that saved in GCS, so it's a crappy workaround.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message