qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Conway (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (QPID-5719) HA becomes unresponsive once any of the brokers are SIGSTOPed
Date Thu, 24 Apr 2014 17:57:20 GMT

     [ https://issues.apache.org/jira/browse/QPID-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Conway updated QPID-5719:

    Status: Reviewable  (was: In Progress)

> HA becomes unresponsive once any of the brokers are SIGSTOPed
> -------------------------------------------------------------
>                 Key: QPID-5719
>                 URL: https://issues.apache.org/jira/browse/QPID-5719
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>         Attachments: ha-heartbeat.diff
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
> Description of problem:
> qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
> There are three different cases:
> a] stopped ALL brokers
> b] stopped the primary
> c] stopped a backup
> In any of above listed cases following observations were made:
> a-c]    RHCS clustat is just fine and report everything is just ok
> a-c]    qpid-ha (status --all) hangs
> a,b,c*] any other clients are indefinitely blocked
>         a-b] cases directly at the beginning
>         c] case at the end, client able to recover after minute or so,
>            due to connection timeout
> In fact this defect also proves that qpid-ha can be out of sync when compared to clustat
as tracked by BZ.
> The expectations are:
>  * a] quorum lost HA down (same as kill -9 to all nodes)
>       no clients able to communicate
>  * b] promotion of new primary, there has to be mechanism to get rid of stopped process
>       clients should be able to communicate after recovery
>  * c] unresponsive backup should get restarted
>       clients should be able to communicate after duration when backup is detected as
>  * Generally better integration Qpid HA environment <-> RHCS is needed
>    aka SIGSTOP detection
>  * Heartbeat primary <-> backups probably needed

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

View raw message