hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable
Date Fri, 13 Nov 2015 22:34:11 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004828#comment-15004828

Karthik Kambatla commented on HADOOP-10584:

Based on my recollection from a while ago and briefly looking at the attached prelim patch,
there are a couple of issues here:
# When RM loses connection while executing an operation, the operation just fails without
enough retries. The patch adds a retry-loop to handle this.
# When RM loses connection to ZK but doesn't give up being Active. This leads to the RM continuing
to serve apps and nodes connected to it. The patch, in addition to rejoining election, has
the client (ZKFC/RM) enter neutral mode. Today, the RM doesn't do anything on {{enterNeutralMode}}
but of course this can be improved going forward. 

I won't be able to work on this for the next month or so. If anyone has cycles, please feel
free to take it up. 

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down,
it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on
its client, and continue to attempt connecting to ZK.

This message was sent by Atlassian JIRA

View raw message