qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Conway (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (QPID-4201) Destination cluster de-sync when federation link used for a longer time
Date Thu, 17 Jan 2013 16:34:12 GMT

     [ https://issues.apache.org/jira/browse/QPID-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Conway resolved QPID-4201.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.19)
                   0.20

This issue affects the old cluster which is no longer part of Qpid for the 0.20 release.
                
> Destination cluster de-sync when federation link used for a longer time
> -----------------------------------------------------------------------
>
>                 Key: QPID-4201
>                 URL: https://issues.apache.org/jira/browse/QPID-4201
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.18
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.20
>
>
> (see also  https://bugzilla.redhat.com/show_bug.cgi?id=836141)
> Description of problem:
> Using queue state replication from a broker (possibly clustered - this does not matter)
to a cluster of brokers cause cluster de-sync after a long time:
> 2012-06-28 08:28:30 critical Error delivering frames: local error did not occur on all
cluster members : invalid-argument: @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed
< (4799+0) but only sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Version-Release number of selected component (if applicable):
> every checked 
> How reproducible:
> depending on time, but 10% for default scenario
> Steps to Reproduce:
> (ideally, if possible, rebuild qpid with changing cpp/src/qpid/SessionState.cpp: static
const uint32_t SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the
reproducer)
> 1) Have source broker (or cluster, this does not matter) and dest.cluster with queue
state replication of just one queue from source do dest.cluster.
> 2) On the federation route, setup --ack to some low number (to speedup replication, I
used --ack 5).
> 3) Randomly produce and consume messages to the src.broker to the queue to be replicated
- ideally, do the enqueues and dequeues as much alternating as possible. Dont know why, but
more alternates speeds up the reproducer as well.
> 4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default 64k) of some
synchronization messages _from_ the backup cluster (that requires around 100times more messages
to be enqueued and dequeued on the replicated queue), there is a probability to hit the bug.
Once it was hit on the first attempt (after 2^16 = 64k of such synchronization messages),
once after 720896 messages (in 11th "round" / "trial").
>   
> Actual results:
> All brokers in dst.cluster - except the one that has the fed.link established - shut
down with log:
> 2012-06-27 15:39:46 critical Error delivering frames: local error did not occur on all
cluster members : invalid-argument: @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed
< (720895+0) but only sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
> Expected results:
> No such cluster de-sync
> Additional info:
> - interesting fact: I was able to reproduce it using queue state replication - only.
Despite the bug is on federation link session, using fed.link without queue state replication
did not lead to the bug.
> - the difference comes from the _beginning_ of session communication, per some traces,
these AMQP messages sent from dst.cluster to the source are _not_ replayed by (even not multicasted
to) the "other dst.brokers" (that have the session / connection as shadow, not local). So
these messages are not replayed:
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 0: {MessageSubscribeBody:
queue=replication-queue; destination=replication-exchange; accept-mode=0; acquire-mode=0;
resume-id resume-ttl=0; arguments={qpid.sync_frequency:F4:int32(100)}; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 1: {MessageFlowBody:
destination=replication-exchange; unit=0; value=4294967295; }
> 2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 2: {MessageFlowBody:
destination=replication-exchange; unit=1; value=4294967295; }
> [reply] [-]
> Private
> Comment 1 Pavel Moravec 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org


Mime
View raw message