hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-10584) ActiveStandbyElector goes down if ZK quorum become unavailable
Date Mon, 02 Nov 2015 22:14:27 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Vinod Kumar Vavilapalli updated HADOOP-10584:
    Target Version/s: 2.7.3  (was: 2.7.2)

Too late for 2.7.2, moving this out.

Overall, I am tending to think that ActiveStandbyElector is already doing the right thing
and it's just the higher layers passing in insufficient retry configurations.

All the retries inside ActiveStandbyElector are for CONNECTIONLOSS and OPERATIONTIMEOUT events,
so it sounds silly that we pass in a retry-count of 3 (together with a 10 seconds in YARN
/ 5 seconds in HDFS session time-out) in both HDFS and YARN.

If you agree, I think we should just bump up these defaults so that we can retry for 'enough'
time as is acceptable. Thoughts?

> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>                 Key: HADOOP-10584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10584
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: hadoop-10584-prelim.patch, rm.log
> ActiveStandbyElector retries operations for a few times. If the ZK quorum itself is down,
it goes down and the daemons will have to be brought up again. 
> Instead, it should log the fact that it is unable to talk to ZK, call becomeStandby on
its client, and continue to attempt connecting to ZK.

This message was sent by Atlassian JIRA

View raw message