flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiaogang Shi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
Date Fri, 12 Apr 2019 09:18:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816118#comment-16816118

Xiaogang Shi commented on FLINK-10052:

What's the status of the issue? [~Wosinsan]

{{SessionConnectionStateErrorPolicy}}  is introduced in Curator 3.0 while Flink is using
Curator 2.12.

Since Curator 3.x has problems in the compatibility with Zookeeper 3.x, [Zookeeper Compatibility
| [https://curator.apache.org/zk-compatibility.html]] , we should bump our Curator version
to 4.x to use {{SessionConnectionStateErrorPolicy}}. 

What do you think? [~till.rohrmann]

> Tolerate temporarily suspended ZooKeeper connections
> ----------------------------------------------------
>                 Key: FLINK-10052
>                 URL: https://issues.apache.org/jira/browse/FLINK-10052
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.4.2, 1.5.2, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Dominik Wosiński
>            Priority: Major
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA recovery
and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator recipe for leader
election. The leader latch revokes leadership in case of a suspended ZooKeeper connection.
This can be premature in case that the system can reconnect to ZooKeeper before its session
expires. The effect of the lost leadership is that all jobs will be canceled and directly
restarted after regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it
would be better to wait until the ZooKeeper connection is LOST. That way we would allow the
system to reconnect and not lose the leadership. This could be achievable by using Curator's
{{LeaderSelector}} instead of the {{LeaderLatch}}.

This message was sent by Atlassian JIRA

View raw message