qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Moseley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (QPID-2992) Cluster failing to resurrect durable static route depending on order of shutdown
Date Tue, 11 Jan 2011 00:58:48 GMT

    [ https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979883#action_12979883

Mark Moseley commented on QPID-2992:

On one of the nodes in question. I tried reproducing with this script and it seemed to work
perfectly. I added authentication as well, and it continued to work ok. Your test script is
pretty much exactly what I'm doing too.

I wonder though (and I'm just trying to think of reasons why it'd act differently in the two
scenarios) can you try this out on 4 separate nodes, even if virtualized? Though when I reproduce
this on the physical nodes, with debug logging turned on, it doesn't mention the node on the
other side of the federated link, whereas when it does work, I see this in the logs:

2011-01-10 19:35:12 debug Known hosts for peer of inter-broker link: amqp:tcp:

Running through this again today, I noticed that sometimes, with a completely fresh cluster,
the connection in a B2->B1->B1->B2 shutdown/startup does work. But then I do it again
and it doesn't. Or if I do the opposite order it breaks as well.

I just modified your script so that after the first round of stop/start/check-binding, it
flips the order and shuts them down again and starts them up -- and yes, I realize this is
the opposite order from my ticket :) -- and re-checks bindings and they're gone. I'm attaching
the output of your script.

(Just for clarification,,,,
and I've been trying to regex the hostnames so you guys didn't have
to deal with following my hostnames, but if you guys prefer, I don't mind just using the real

> Cluster failing to resurrect durable static route depending on order of shutdown
> --------------------------------------------------------------------------------
>                 Key: QPID-2992
>                 URL: https://issues.apache.org/jira/browse/QPID-2992
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker, C++ Clustering
>    Affects Versions: 0.8
>         Environment: Debian Linux Squeeze, 32-bit, kernel, Dell Poweredge 1950s.
Corosync==1.3.0, Openais==1.1.4
>            Reporter: Mark Moseley
>            Assignee: Alan Conway
>         Attachments: cluster-fed.sh
> I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together
with a single durable static route between each. Qpid is version 0.8. Corosync and openais
are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge
1950s, kernel 2.6.36. The static route is durable and is set up over SSL (but I can replicate
as well with non-SSL). I've tried to normalize the hostnames below to make things clearer;
hopefully I didn't mess anything up.
> Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B (with B1
and B2), I've got a static exchange route from A1 to B1, as well as another from B1 to A1.
Federation is working correctly, so I can send a message on A2 and have it successfully retrieved
on B2. The exchange local to cluster A is walmyex1; the local exchange for B is bosmyex1.
> If I shut down the cluster in this order: B2, then B1, and start back up with B1, B2,
the static route route fails to get recreated. That is, on A1/A2, looking at the bindings,
exchange 'bosmyex1' does not get re-bound to cluster B; the only output for it in "qpid-config
exchanges --bindings" is just:
> <snip>
> Exchange 'bosmyex1' (direct)
> </snip>
> If however I shut the cluster down in this order: B1, then B2, and start B2, then B1,
the static route gets re-bound. The output then is:
> <snip>
> Exchange 'bosmyex1' (direct)
>     bind [unix.boston.cust] => bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
> </bind>
> and I can message over the federated link with no further modification. Prior to a few
minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 and corosync==1.2.1.
In debugging this, I've upgraded both to the latest versions with no change.
> I can replicate this every time I try. These are just test clusters, so I don't have
any other activity going on on them, or any other exchanges/queues. My steps:
> On all boxes in cluster A and B:
> * Kill the qpidd if it's running and delete all existing store files, i.e. contents of
> On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):
> * Start up qpid
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue walmyq1 --durable
> * qpid-config bind walmyex1 walmyq1 unix.waltham.cust
> On host B1 in cluster B:
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue bosmyq1 --durable
> * qpid-config bind bosmyex1 bosmyq1 unix.boston.cust
> On cluster A:
> * Start other member of cluster, A2
> * qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1
unix.waltham.cust -d
> On cluster B:
> * Start other member of cluster, B2
> * qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1
unix.boston.cust -d
> On either cluster:
> * Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote
> * To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start
cluster back up, check bindings.
> * To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster
back up, check bindings.
> This is a test cluster, so I'm free to do anything with it, debugging-wise, that would
be useful. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org

View raw message