qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Moseley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (QPID-2982) Inconsistent management messages in a cluster
Date Sat, 08 Jan 2011 00:57:46 GMT

    [ https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979045#action_12979045
] 

Mark Moseley commented on QPID-2982:
------------------------------------

Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with Squeeze corosync
(1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0 and 1.1.4, respectively) and
still get this error.

As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and B2).

I'm doing the same setup as in 2992:


On all boxes in cluster A and B:
* Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/

On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):
* Start up qpid
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue walmyq1 --durable
* qpid-config bind walmyex1 walmyq1 unix.waltham.cust

On host B1 in cluster B:
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue bosmyq1 --durable
* qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

On cluster A:
* Start other member of cluster, A2
* qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1
unix.waltham.cust -d

On cluster B:
* Start other member of cluster, B2
* qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1
unix.boston.cust -d

On either cluster:
* Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
* To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster
back up, check bindings.
* To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster
back up, check bindings.


If on A1, I run qpid-tool and do something like 'list' enough times that it returns data (instead
of the first few times where it returns nothing), on node A2, I get the following very often
This is the log on A2 all the way from startup, which is only 10-15 secs earlier:

2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replicating_listener.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replication_exchange.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so
2011-01-07 19:35:30 info Management enabled
2011-01-07 19:35:30 info Registered xml exchange
2011-01-07 19:35:30 info Registered replication exchange
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid
2011-01-07 19:35:31 info > Default files per journal: 8
2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib)
2011-01-07 19:35:31 info > Default number of write cache pages: 32
2011-01-07 19:35:31 info > TPL files per journal: 8
2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib)
2011-01-07 19:35:31 info > TPL number of write cache pages: 64
2011-01-07 19:35:31 notice Initializing CPG
2011-01-07 19:35:31 notice Cluster store state: dirty cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration change: 101.0.0.0:10552
102.0.0.0:13904 
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined: 102.0.0.0:13904

2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded and journal
files pushed down
2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down (saved) into
directory /var/lib/qpid/_cluster.bak.0002.
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 info SASL enabled
2011-01-07 19:35:31 notice Listening on TCP port 5672
2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf"
2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links.
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster walclust
2011-01-07 19:35:31 notice Broker running
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer from 101.0.0.0:10552
to 102.0.0.0:13904
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update from 101.0.0.0:10552
2011-01-07 19:35:31 notice Journal "walmyq1": Created
2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0;
maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-6
2011-01-07 19:35:31 info Queue "bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy
created: type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-7
2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created: type=reject; maxCount=0;
maxSize=104857600
2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created: type=reject; maxCount=0;
maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-9
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member)
102.0.0.0:13904(unknown) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete, starting catch-up.
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member)
102.0.0.0:13904(member) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up.
2011-01-07 19:35:32 info Passivating links
2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created: type=reject; maxCount=0;
maxSize=104857600
2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created: type=reject; maxCount=0;
maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created: type=ring; maxCount=0;
maxSize=104857600
2011-01-07 19:35:51 error Execution exception: invalid-argument: mark@QPID.HOSTA1.10670.1:
confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error 2770 did not
occur on member 101.0.0.0:10552: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed <
(109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical Error delivering frames: local error did not occur on all cluster
members : invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent
< (108+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster walclust
2011-01-07 19:35:51 notice Shut down


In the logs above:
* 'mark' is the qpid SASL username I'm using. It's definitely valid on both nodes, as I can
use it in qpid-config just fine in A1 and A2. I also use it in qpid-tool successfully.
* Those "Policy created" messages show up when I initially did 'list' in qpid-tool on A1.
* When 'list' returned data on A1, A2 shut down.

I don't get the above error 100% of the time. Sometimes it'll work consistently and then other
times it won't, and won't for a while (and A2 will shut back down almost immediately after
rejoining the cluster). The only times that I can consistently get it to crash is if I replicate
the broken scenario in JIRA 2992. After doing a fail of B2->B1, A2 will crash as soon as
qpid-tool starts returning data (but not before), even if I restart A1 and A2 repeatedly.

These are test clusters, so there's nothing else going on on them. As well, I can perform
whatever testing would be useful to the Qpid team.

> Inconsistent management messages in a cluster
> ---------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which can cause
a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management code and collects
detailed trace logging. The logs are compared for consistency.
> Currently this test fails. All inconsistencies revealed by the test need to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


Mime
View raw message