qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Markey <a...@etinternational.com>
Subject Bug in persistent message store startup recovery (C++ broker)
Date Fri, 02 Mar 2012 22:56:18 GMT
I found a bug in the C++ broker's journal recovery code in the
persistent message store library from qpidcomponents.org. I'm not sure
if this is the correct place to post this, but I couldn't find that
component in either jira or RedHat's bugzilla. If anyone knows where to
log this as a bug please just let me know where and what component it
belongs under.

Reproducing:
1) launch qpidd with 4 journal files of default size (24 pages of 64KB
each)
2) create a persistent queue and send messages until an "Enqueue
capacity threshold" error is generated. The messages mus be sized such
that a) the first journal file contains more than a single message and
b) when the enque capacity is hit, all available space in the first 3
journal files must be used.
3) retrieve and accept the first message.
4) shutdown the broker
5) when attempting to restart the broker, a JERR_JCNTL_RECOVERJFULL
error is generated and the broker exits.

Expected operation:
broker starts and recovers stored messages normally in step 5.

I'm attaching a reproducer which does steps 2 & 3 using messages sized
to use 1/2 of a journal file. With messages of that size, the bug is
only triggered by deleting a single message; deleting more than one or
no messages at all allows the broker to be restarted successfully.

Note that this is always retrieving messages and deleting them in the
order that they were inserted. The database can only be properly
recovered if all messages residing in the first journal file are deleted
before shutting down the broker once the enqueue capacity exactly
matches the threshold. 


I've investigated this some and the cause is that the enqueue capacity
is not enforced against delete operations (since they are actually
freeing space), but but the distinction is not made when restarting.
Thus, delete operations in the journal that are beyond the enqueue
capacity trigger the journal full error and makes the store
unrecoverable. This is because the journal is considered full if the
last file id is directly before the first file id and the first file
isn't empty.

I have come up with both a workaround and a proposed patch:

The workaround is to never use less than 7 journal files. With 7 journal
files, the 20% reserved space is equal to 1.4 journal files. This gives
us 40% of a journal file that can be filled with delete operations
before anything is stored in the last journal file. Because deletions
only take 1 DBLK, in the worst case (all messages only use 1 SBLK) we
can fully delete all messages from the first journal file using 25% of a
journal file. So enough deletes can always fit into the last 40% of the
second to last journal file. By the time we get to the last journal
file, the first file has been completely emptied.

The proposed patch is a one-liner that modifies the journal full
calculation as described in the following psuedo-code:

journal full = original full calculation && (something enqueued in last
file || last file is full)

Explanation:
Nothing shoud ever be enqueued in the last file since the enqueue
capacity check should prevent that, so this is basically a sanity check.
Otherwise, as long as there is still space in the last file, the journal
is not full: the last file only contains deletes or transactions and
there is still space for more deletes or transactions. In this case the
journal can be safely recovered, and once more messages are deleted the
enqueue capacity can drop below the threshold allowing enqueues to
resume.


Mime
View raw message