directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Re: [Mavibot] BtreeHeader management, transactions and other things...
Date Mon, 14 Sep 2015 08:31:37 GMT
Le 14/09/15 04:40, Kiran Ayyagari a écrit :
> On Mon, Sep 14, 2015 at 5:44 AM, Emmanuel Lécharny <>
> wrote:
>> Hi,
>> I have looked at the code today, and I found that the way we handle the
>> BtreeHeader is a bit complex, and does not fit some other ideas I had
>> regarding the management of transactions.
>> Currently, we store a map of BH where for each BTree, we have the latest
>> BH (ie, the one associated with the most recent revision). When we want
>> to update a btree, or read it, we first check this map and use the
>> returned BH to start updating or reading the BTREE.
>> This is not good, IMO.
>> Actually, we should always fetch the most recent revision for a given
>> BTree from the BOB. That change the implementation of the
>> getBtreeHeader() method.
>> Why should we do it differently, and how does it connect with teh TXNs ?
>> That simple (well, sort of). txns will hold in a working memory (WM) all
>> the pages that will be updated from teh beginning to the end of the
>> transaction, allowing us to avoid many updates on disk - currently, the
>> way we process transaction is pretty brutal : we write teh modified
>> pages on disk, until teh end of the txn, even if we might very well
>> modify one of those pages -.
>> So the 'new way' should update the pages we have in the WM. That is
>> possible if we reference pages using their offset, but then that changes
>> the way we process the pages (currently, we preemptively copy a page
>> that we are going to modify). We will *not* anymore copy a page if it's
>> present in the WM, we will just update it. At the end, teh WM will
>> contain all the modified pages, and we will just have to write them on
>> disc (or discard them) when we commit (or rollback) teh transaction.
>> But the current code has only two way to fetch a page :
>> - either it's in the cache, and we return it
>> - or we read the page from disk
>> (This is what the PersistedPageHolder.getValue() does)
>> We need to add a third possibility : to get the page from the WM, when
>> we are updating the BTree, and if it's not present in teh WM, then fetch
>> it (from the cache or the disk) and put it into the WM.
>> Then the update (insert or delete) must be done without creating a copy.
>> That is a huge change in the code... But thsi is necessary if we want to
>> have an efficient transaction handling. It also allow us to get rid of
>> those synchronized Maps containing the BTreeHeaders.
>> One more things (à la Apple) : we most certainly don't need to manage
>> multiple values with sub-btrees in Mavibot : As soon as we have a fully
>> working transaction system, we could perfectly expect the application to
>> deal with such a specific case : all in all, in a Btree<K, V>, where V
>> is the user's data structure, it's up to the user to make V a BTree, and
>> to deal with it. As we will have a cross- b-tree transaction system, it
>> won't be expensive, plus this is already what we do with JDBM, so the
>> ApacheDS code will not be difficult to port.
> we should move to explicit begin() and commit() to support the cross Btree
> transactions, 

Absolutely. The current transaction support is less than half baked, it
only support per-BTree transaction, which is pretty much useless.

Actually, we do need to pass a context to each Btree operation, context
that holds the revision of each B-trees being part of the transaction.
That also mean we have one common revision for each transaction,
revision which ties all the Btrees' revisions that were part of the

Typically, in a ApacheDS context, when we update an entry, we nt only
update the master BTree, but the RDN btree and every btree associated
with each index. And all those updates must be visible as a whole when
we want to fetch an entry or an index for this revision.

the solution would be to use the transaction ID (which will be
incremental) as the revision number for all the b-trees being updated
during that transaction. That saves us the pain of keeping a track of
the various B-trees revisions when applying an operation to a given
revision of a b-tree.

View raw message