qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Conway (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (QPID-5904) qpid HA cluster may end-up in joining state after HA primary is killed
Date Mon, 01 Dec 2014 20:34:12 GMT

     [ https://issues.apache.org/jira/browse/QPID-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Conway resolved QPID-5904.
       Resolution: Fixed
    Fix Version/s: 0.30

r1614895 | aconway | 2014-07-31 09:55:11 -0400 (Thu, 31 Jul 2014) | 40 lines

QPID-5942: qpid HA cluster may end-up in joining state after HA primary is killed

There are two issues here, both related to the fact that rgmanager sees qpidd
and qpidd-primary as two separate services.

1. The service start/stop scripts can be called concurrently. This can lead to
   running a qpidd process who's pid is not in the pidfile. rgmanager cannot
   detect or kill this qpidd and cannot start another qpidd because of the lock
   on the qpidd data directory.

2. rgmanager sees a primary failure as two failures: qpidd and qpidd-primary,
   and will then try to stop and start both services. The order of these actions
   is not defined and can lead to rgmanager killing a service it has just

This patch makes two major changes to the init scripts:

1. Uses flock to lock the sensitive stop/start part of the scripts to ensure
   they are not executed concurrently.

2. On "stop" the scripts check if a running qpidd is primary or not. "qpidd stop"
   is a no-op if the running broker is primary, "qpidd-primary stop" is a no op
   if it is not. This ensures that a broker will be stopped by the same stream
   of service actions that started it.

Minor changes in this patch:
- better logging of broker start-up and shut-down sequence.
- qpid-ha heartbeat use half of timeout option.
- add missing timeouts in qpid-ha.


This changes the behavior of 'clusvcadm -d <qpidd-service>' on the primary node.
Previously this would have stopped the qpidd service on that node, killed the
qpidd process and relocated the primary service. Now this will stop the qpidd
service (as far as rgmanager is concerned) but will not kill qpidd or relocate
the primary service. When the primary is relocated the qpidd service wil not be
able to re-start on that node until it is re-enabled with 'clusvcadm -e'.


> qpid HA cluster may end-up in joining state after HA primary is killed
> ----------------------------------------------------------------------
>                 Key: QPID-5904
>                 URL: https://issues.apache.org/jira/browse/QPID-5904
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.30
>  Frantisek Reznicek 2014-07-09 08:59:30 EDT
> Description of problem:
> qpid HA cluster may end-up in joining state after HA primary is killed.
> Test scenario.
> Let's have 3 node qpid HA cluster, all three nodes are operational.
> Then a sender is executed and sending to queue (pure transactional with durable messages
and durable queue address).
> During that process primary broker is killed multiple times.
> After N'th primary broker kill cluster is no longer functional as qpid brokers are ending
all in joining states:
> [root@dhcp-lab-216 ~]# qpid-ha status --all
> joining
> joining
> joining
> [root@dhcp-x-216 ~]# clustat
> Cluster Status for dtests_ha @ Wed Jul  9 14:38:44 2014
> Member Status: Quorate
>  Member Name                                   ID   Status
>  ------ ----                                   ---- ------
>                                      1 Online, Local, rgmanager
>                                      2 Online, rgmanager
>                                      3 Online, rgmanager
>  Service Name                         Owner (Last)                         State    
>  ------- ----                         ----- ------                         -----    
>  service:qpidd_1                                     started  
>  service:qpidd_2                                     started  
>  service:qpidd_3                                     started  
>  service:qpidd_primary                (                       stopped  
> [root@dhcp-x-165 ~]# qpid-ha status --all
> joining
> joining
> joining
> [root@dhcp-x-218 ~]# qpid-ha status --all
> joining
> joining
> joining
> I believe the key to hit the issue is to kill the newly promoted primary soon after it
starts appearing in starting/started state in clustat.
> My current understanding is that if we have 3 node cluster then applying any failures
to single node at one time should be handled by HA. This is what the testing scenario does:
> A    B    C (nodes)
> pri  bck  bck
> kill 
> bck  pri  bck
>      kill
> bck  bck  pri
>           kill
> ...
> pri  bck  bck
> kill
> bck  bck  bck
> It looks to me that there is short time when promoting new primary when kill causes (of
such primary newbee) causes promotion procedure to stuck in all joining.
> I haven't seen such behavior in past, either we are now more sensitive to such case (after
-STOP case fixes) or the durability turned on rapidly raises the probability.
> Version-Release number of selected component (if applicable):
> # rpm -qa | grep qpid | sort
> perl-qpid-0.22-13.el6.i686
> perl-qpid-debuginfo-0.22-13.el6.i686
> python-qpid-0.22-15.el6.noarch
> python-qpid-proton-doc-0.5-9.el6.noarch
> python-qpid-qmf-0.22-33.el6.i686
> qpid-cpp-client-0.22-42.el6.i686
> qpid-cpp-client-devel-0.22-42.el6.i686
> qpid-cpp-client-devel-docs-0.22-42.el6.noarch
> qpid-cpp-client-rdma-0.22-42.el6.i686
> qpid-cpp-debuginfo-0.22-42.el6.i686
> qpid-cpp-server-0.22-42.el6.i686
> qpid-cpp-server-devel-0.22-42.el6.i686
> qpid-cpp-server-ha-0.22-42.el6.i686
> qpid-cpp-server-linearstore-0.22-42.el6.i686
> qpid-cpp-server-rdma-0.22-42.el6.i686
> qpid-cpp-server-xml-0.22-42.el6.i686
> qpid-java-client-0.22-6.el6.noarch
> qpid-java-common-0.22-6.el6.noarch
> qpid-java-example-0.22-6.el6.noarch
> qpid-jca-0.22-2.el6.noarch
> qpid-jca-xarecovery-0.22-2.el6.noarch
> qpid-jca-zip-0.22-2.el6.noarch
> qpid-proton-c-0.7-2.el6.i686
> qpid-proton-c-devel-0.7-2.el6.i686
> qpid-proton-c-devel-doc-0.5-9.el6.noarch
> qpid-proton-debuginfo-0.7-2.el6.i686
> qpid-qmf-0.22-33.el6.i686
> qpid-qmf-debuginfo-0.22-33.el6.i686
> qpid-qmf-devel-0.22-33.el6.i686
> qpid-snmpd-1.0.0-16.el6.i686
> qpid-snmpd-debuginfo-1.0.0-16.el6.i686
> qpid-tests-0.22-15.el6.noarch
> qpid-tools-0.22-13.el6.noarch
> ruby-qpid-qmf-0.22-33.el6.i686
> How reproducible:
> rarely, timing is the key
> Steps to Reproduce:
> 1. have configured 3 node cluster
> 2. start the whole cluster up
> 3. execute transactional sender to durable queue address with durable messages and reconnect
> 4. repeatedly kill the primary broker once it is promoted
> Actual results:
>   After few kills cluster ends up not functional all in joining. Ability to bring qpid
HA down by inserting single isolated failures to newly being promoted brokers.
> Expected results:
>   Qpid HA should be single failure at one time tolerant.
> Additional info:
>   Details on failure insertion:
>     * kill -9 `pidof qpidd` is the failure action
>     * Assuming the duration between failure insertion and primary is ready to serve named
as T1
>     * failure insertion period T2 > T1 i.e. there are no cummulative failures inserted
while HA is getting through new primary promotion
>       -> this fact (in my view) proves that there is real issue

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

View raw message