qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Conway" <acon...@redhat.com>
Subject Re: Review Request 19566: QPID-5642: Message sequence numbers (enabled with qpid.msg_sequence) should be persistent across broker restarts
Date Wed, 02 Apr 2014 15:17:11 GMT


> On April 1, 2014, 9:17 p.m., Alan Conway wrote:
> > Sorry, catching up. I think this patch is abusing the bootsequence which really
has nothing at all to do with exchange sequence numbers. It would work on a standalone (but
still be an abuse IMO) broker but it doesn't work for HA. The base for exchage sequence numbers
has to be synchronized across a HA cluster, whereas the number of times each broker has rebooted
*cannot* be synchronized without breaking its existing meaning as a reboot count. I'm also
a bit wary of the division of sequence number into 17 bits for "restart" count 47 bits for
"message" count. What are the grounds for that division? Is it safe to assume no overflows?
> > 
> > I see the goal of avoiding persisting the sequence number on every message, but
how about persisting it (per exchange) on some suitably large number of messages? E.g. persist
an initial next-sequence of 64k for each exchange, with the sequence number starting at 1.
Once the sequence number gets within 32k of next-sequence we increase next-sequence of by
64k etc. On reboot or on promotion of the primary we jump each exchange sequence number ahead
to the exchange's next-sequence number.
> > 
> > That lets us take smaller jumps on each restart/failover (16 bits vs. 47 bits) and
avoids using a single counter for every exchange which makes overflow less likely. The next-sequence
number for each exchange would need to be persisted, replicated on joining (both can be done
by putting it back in the args) and replicated via a management event when it is increased
on the primary. Jumping the next-sequence number ahead when we still have 32k messages left
to go means that we have time to store & replicate before we pass the new limit. The next-sequence
doesn't have to be precisely synchronized in a cluster provided it is always a good margin
bigger than the highest sequence number used by that exchange on the primary so far. During
promotion (before allowing backups to join) we would need to jump each exchange sequence ahead
to the next-sequence for the exchange and increase next-sequence by 64k to make sure the new
primary is using higher sequence numbers than the old.
> > 
> > Are we sure nobody assumes that exchange sequence numbers are sequential? The above
is fine for duplicate detection but are we sure nobody does missed message detection by checking
for gaps in the exchange sequence numbers? That would complicate everything quite a bit...
> > 
> > 
> >
> 
> Pavel Moravec wrote:
>     This seems to be sound to me. To sum it up:
>     - exchanges with sequencing to have new argument qpid.next_sequenceNo, by default
set to 64k for new exchanges, or to the value read from store during startup (or being updated
from primary broker)
>       - if the argument is missing in stored args, it will be automatically added there
(to support upgrades)
>     - when incrementing sequenceNo in preRoute, if args(qpid.next_sequenceNo) - sequenceNo
== 32k, then execute 3 steps:
>       - increment qpid.next_sequenceNo by 64k
>       - update store bdb for the exchange
>       - if clustered broker and (primary or being promoted), notify backups about the
increment of qpid.next_sequenceNo
>     - when joining a cluster, active broker will update the joiner one automatically
(via exchange arguments) - no need to change
>     - when promoting a node to active:
>       - set sequenceNo = args(qpid.next_sequenceNo)
>       - execute the same 3 steps from "when incrementing sequenceNo"
>     
>     Two questions:
>     1) Can a node being promoted already update backups? Or is there a way of postponing
the update? (see latest point above)
>     2) What if - after a new promotion - the broker sends few messages with sequenceNo
say 128k but dies before it really writes to the disk args(qpid.next_sequenceNo)=(128+64)k
? Can it even happen?
>     
>     
>     Sequence number will then jump ahead by at most 96k whenever a standalone broker
is restarted or new promotion in cluster happens (or cold start of cluster happens). That
means, there can be 93824992236885 such "jump events" without sequence number overflow (or
93824671436104 "jump events" for steadily message flow of 1M messages per second for 10 years).
More than fair enough.
>     
>     
>     Comment to "Are we sure nobody assumes that exchange sequence numbers are sequential?"
- Consumers get messages from queues and no queue can be guaranteed to get _all_ messages
routed by the exchange (until all know the exchange is fanout or binding key is empty or so
- but I think nobody can build on this premise). So jumps in exchange sequence numbers can
be always meant as "the skipped messages were not meant for this queue".
>
> 
> Alan Conway wrote:
>     Notifying backups of changes is done via QMF events, so a broker can always emit
a "nextSequenceNo changed" event and if there are backups listening they will pick it up.

>     
>     So to your questions:
>     
>     1) A node being promoted can emit QMF events but there is not a guaranteed ordering
that the Exchange response will be seen before any QMF events for that exchange. This is a
real race condition, I have made fixes for other special cases of this problem. Either we
can make another special case, or I have an idea for a more general fix for the problem. Let
me look at this, for your work assume that it will be OK to send a nextSequenceNo event during
update. I'll update this review when I have a patch for the race condition.
>     
>     2) The reason that I propose updating the nextSequenceNo when we get within 32k of
the current is to ensure we have time to write the disk & do the updates before the actual
sequence number is used. E.g. say nextSeqNo=128, actualSeqNo=96k. At this point we will update
nextSeqNo to 140k. I am assuming that we will not receive 32k new messages before we can write
this to disk, so even if we crash before the write, with nextSeqNo=128 still on disk, it will
still be higher than any actual sequence number we have sent. I am also assuming that the
nextSeqNo update QMF event will be processed by backups before we receive 32k new messages,
to ensure that if we fail the new primary will have a sufficient nextSeqNo. We could increase
the safety margin at the cost of decreasing our jumps-to-overflow.
>     
>     The alternative to a "safety margin" would be to effectively lock the Exchange while
we are updating the nextSeqNo. on disk and to the cluster. That might be ok for performance
since we don't do it often. However it would be hard to implement without deadlocks, especially
the cluster update part which would also require a new feedback mechanism in the cluster.
I prefer the safety margin approach as being simpler provided the margin is generous enough.
>     
>     
>
> 
> Pavel Moravec wrote:
>     Re 2) I meant scenario when a broker is being promoted. Then it sets actualSeqNo=nextSeqNo;
nextSeqNo+=64k; (and writes it to the disk and notifies backups). Until the broker writes
the update to the disk, there is in fact no safety margin available. But to really hit duplicate
sequence number being set to a message, _all_ brokers in the cluster would have to be killed
before writing the updated nextSequenceNo to the disk - low probable event we can live with,
in my opinion.

We can solve this by making sure a newly promoted primary does not go READY until all nextSeqNo
have been written to disk and all exchanges have been replicated with the updated nextSeqNo.
That's not the current behavior, we only wait for queues to be ready, but I'll fix that.


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19566/#review39200
-----------------------------------------------------------


On April 1, 2014, 10:45 a.m., Pavel Moravec wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19566/
> -----------------------------------------------------------
> 
> (Updated April 1, 2014, 10:45 a.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Kim van der Riet.
> 
> 
> Bugs: QPID-5642
>     https://issues.apache.org/jira/browse/QPID-5642
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Elegant (but not performance optimal) way of patch:
> 
> 1) In Exchange::PreRoute::PreRoute, update exchange in store (whole entry).
> 2) The update method is dealed very similarly like MessageStoreImpl::create(db_ptr db,
IdSequence& seq, const qpid::broker::Persistable& p) method, i.e. calls BufferValue
that calls Exchange::encode.
> 
> Here the code can be unified by merging MessageStoreImpl::update intoMessageStoreImpl::create
method where the code almost duplicates.
> 
> However I do not see the patch as performance efficient, as with every message preroute,
new qpid::framing::Buffer is filled in Exchange::encode method, data are copied from it to
char* BufferValue::data and even then they are really written to the BDB. While in fact we
just update the only one number in the Buffer.
> 
> I tried to come up with less performance invasive approach (for those interested, see
https://bugzilla.redhat.com/attachment.cgi?id=877576&action=diff - if you dont have access
there, let me write), that keeps qpid::framing::Buffer for every durable exchange with sequencing
enabled, but it returned (among others) into the need of changing the way store encodes/decodes
Exchange instance (change in Exchange::encode / decode methods). What would make the broker
backward incompatible.
> 
> Is the performance penalty (due to Exchange::encode method called for every message preroute)
acceptable?
> 
> Is it worth merging MessageStoreImpl::update intoMessageStoreImpl::create method?
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/Exchange.cpp 1582719 
> 
> Diff: https://reviews.apache.org/r/19566/diff/
> 
> 
> Testing
> -------
> 
> - reproducer from JIRA verified
> - automated tests passed (except for those known to fail due to QPID-5641 (valgrind &
legacystore)
> 
> 
> Thanks,
> 
> Pavel Moravec
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message