qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (QPID-5973) HA cluster state may get stuck in recovering
Date Fri, 08 Aug 2014 09:25:15 GMT

    [ https://issues.apache.org/jira/browse/QPID-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090520#comment-14090520

ASF subversion and git services commented on QPID-5973:

Commit 1616702 from [~aconway] in branch 'qpid/trunk'
[ https://svn.apache.org/r1616702 ]

QPID-5973: HA cluster state may get stuck in recovering

A backup queue is considered "ready" when all messages up to the first guarded
position have either been replicated and acknowledged or dequeued.

Previously this was implemented by waiting for the replicationg subscription to
advance to the first guarded position and wating for all expected acks. However
if messages are dequeued out-of-order (which happens with transactions) there
can be a gap at the tail of the queue. The replicating subscription will not
advance past this gap because it only advances when there are messages to
consume. This resulted in backups stuck in catch-up. The recovering primary has
a time-out for backups that never re-connect, but if they connect sucessfully
and don't disconnect, the primary assumes they will become ready and waits -
causing the primary to be stuck in "recovering".

The fixes is to notify a replicating subscription if it becomes "stopped"
because there are no more messages available on the queue. This implies that
either it is at the tail OR there are no more messags until the tail. Either way
we should consider this "ready" from the point of view of HA catch-up.

> HA cluster state may get stuck in recovering
> --------------------------------------------
>                 Key: QPID-5973
>                 URL: https://issues.apache.org/jira/browse/QPID-5973
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>         Attachments: ha
> HA brokers can become stuck in "recovering" or "catchup" state when running transactional
clients with multiple failovers.
> To reproduce, in one window run:
>     while qpid-txtest2 -b --total-messages 1000 --connection-options '{reconnect:true}'
--tx-count 1000; do true; done
> In another window run (ha script attached):
>     while ha wait -a; do sleep .5; ha kill;  done
> After some time one or more brokers willl become stuck in catchup or recovering state.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org

View raw message