directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1722613 - in /directory/site/trunk/content/mavibot: ./ user-guide/ user-guide/images/
Date Sat, 02 Jan 2016 08:38:10 GMT
Author: elecharny
Date: Sat Jan  2 08:38:10 2016
New Revision: 1722613

Updated the mavibot site with additional information in the user guide

    directory/site/trunk/content/mavibot/user-guide/images/R4-node.png   (with props)
    directory/site/trunk/content/mavibot/user-guide/images/r3tor4.gif   (with props)

Modified: directory/site/trunk/content/mavibot/user-guide.mdtext
--- directory/site/trunk/content/mavibot/user-guide.mdtext (original)
+++ directory/site/trunk/content/mavibot/user-guide.mdtext Sat Jan  2 08:38:10 2016
@@ -42,9 +42,7 @@ We are quite interested to improve the c
 * [1 - Introduction](user-guide/1-introduction.html)
     * [1.1 - B-tree basics](user-guide/1.1-btree-basics.html)
 *  [2 - B-tree Flavors](user-guide/2-btree-types.html)
-    * In-Memory
-    * Persistent
-    * Managed
+    * [2.1 - MVCC B-tree](user-guide/2.1-mvcc-btree.html)
 *  [3 - Mavibot B-tree management](user-guide/3-btree-management.html)
     * creation
     * close

Modified: directory/site/trunk/content/mavibot/user-guide/1-introduction.mdtext
--- directory/site/trunk/content/mavibot/user-guide/1-introduction.mdtext (original)
+++ directory/site/trunk/content/mavibot/user-guide/1-introduction.mdtext Sat Jan  2 08:38:10
@@ -26,3 +26,5 @@ The goal of this guide is to explain the
 We hope it will be enough for you to quickly get started, but in any case, if you feel like
improving this document, feel free to post your suggestion(s) to the Apache Directory mailing
list : any contribution is welcomed !
+The first two chapters is all about theory, you can skip thema nd jump to chapter 3 if you
want to get started quickly.

Modified: directory/site/trunk/content/mavibot/user-guide/1.1-btree-basics.mdtext
--- directory/site/trunk/content/mavibot/user-guide/1.1-btree-basics.mdtext (original)
+++ directory/site/trunk/content/mavibot/user-guide/1.1-btree-basics.mdtext Sat Jan  2 08:38:10
@@ -26,13 +26,15 @@ Notice: Licensed to the Apache Software
 A **B-tree** is a "tree data structure that keeps data sorted and allows searches, sequential
access, insertions, and deletions in logarithmic time." (see [Wikipedia](
-The important point here is the last one : it guarantees **O(logn)** operations, compared
to any other data structures (a hashed data structure offers **O(n)** average operations,
but can degenerate to **O(n2)**, and ordering is not kept. But there is more : by gathering
N elements in a page, which leads to  more comparisons than necessary, but still save a lot
of disk accesses, as those comparison will be done in memory when a page is fully loaded.
That's the reason for **B-trees** to have been invented. 
+The important point here is the last one : it guarantees **O(logn)** operations, compared
to any other data structures (a hashed data structure offers **O(n)** average operations,
but can degenerate to **O(n2)**, and ordering is not kept). Although it would be optimal to
keep only 2 elements in each node, it's way better from a performance point of view to keep
more than 2 elements in each node, even if it leads to more comparison than strictly necessary.
The reason is that reading a fixed size from disk is better than reading only what we need
(considering that a disk access is 3 to 4 orders of magnitude slower than RAM access, it's
always a good idea to cache pages in memory, rather than read data from the disk over and
over. And these 3 to 4 orders of magnitude is for SSD... For spinning disks, we are more in
the 4 to 5 orders of magnitude !)
+Add to that the fact that when *B-tree* were invented, data were stored on magnetic tapes,
with a pure sequential access. Reading data sequentially made sense, and reading a block of
data was the way to go - thus the page was meant to store more than 2 elements -.
 **B-trees** are everywhere : databases, OS, etc. It's a critical data structure when you
are to deal with a large set of data.
 1.1.1 - Inside a B-tree
-A **B-Tree** contains **Nodes** and **Leaves**. A *Node* points to other **Nodes** or **Leaves**.
Both **Nodes** and **Leaves** have **Keys** that are associated with *Values*. **Keys** are
not copied in many pages, they just appear once in the whole **B-tree**.
+A **B-Tree** contains **Nodes** and **Leaves**. A *Node* points to other **Nodes** or **Leaves**.
**Nodes** contain only **keys** and references to child **nNodes**, when **Leaves** have **Keys**
that are associated with **Values**. The **Nodes**' **Keys* are used to find the right child
that will ultimately contain the **Value** we are looking for (if it's present)
 Pretty simple !
@@ -42,11 +44,13 @@ A few more rules are enforced :
 * A **Node** and a **Leaf** contains up to N values (N being generally a power of 2, so 2,
4, 8, 16...).
 * You can't have less than N/2 **Values** or **keys** in a **Leaf** or a **Node**, except
for the root **Node**.
+The second rule allows the **B-tree** to have only one **Leaf**, when up to **N** elements
are to be stored. If we want to store one mor element, then the root **Leaf** will be split
and the elements will be spread equally in two **Leaves**, with a root **Node$* containing
references on those two **Leaves**.
 1.1.2 - Concurrent access
 The real problem with **B-tree**s is that we need to use some locks to allow concurrent access
to the data, especially when some updates occur. The rationale is that when browsing a **B-tree**,
changing it will just break the browse operation. There are some technics to avoid having
such an issue, up to a point no read locks are needed, assuming some extra information is
added to the pages : a pointer to the next page at the same level. Sadly, it comes with drawbacks,
like it's difficult to remove empty pages from the **B-tree**.
+What we want is a way to guarantee that many users can read data from the **B-tree**, even
while the **B-tree** is being updated. Of course, that will heavily favor reads over writes,
as writes will have to be serialized, but in many cases, this is a comfortable trade we are
ready to make.

Modified: directory/site/trunk/content/mavibot/user-guide/2-btree-types.mdtext
--- directory/site/trunk/content/mavibot/user-guide/2-btree-types.mdtext (original)
+++ directory/site/trunk/content/mavibot/user-guide/2-btree-types.mdtext Sat Jan  2 08:38:10
@@ -32,21 +32,21 @@ You have many different flavors of **B-t
 * MVCC B-tree
-## 2.1 - B+tree
+## B+tree
 This is a **B-tree** which does not store values in the **Nodes**, and a link between **Leaves**
is added, to speed up the browsing : no need to go up to the parent's node to get the next
value when reaching the end of a leaf. Also the **nodes** don't contain values.
-## 2.2 - B*tree
+## B*tree
 A slightly different form of **B+tree**, where the number of elements per page is enforced
to be at leat 2/3rd of the maximum numbers of elements the page can hold. It speeds up the
retrieval of elements a bit, as we have a denser tree.
-## 2.3 - Counted B-tree
+## Counted B-tree
 Another slightly different implementation of a **B+tree**, where the number of elements stored
in the descendants is stored within each key. This allows an easy count of elements on the
left and right to any element, at the price of additional piece of information being added
on each page.
-## 2.4 - MVCC B-tree
+## MVCC B-tree
-This is a new 'style' of **B+tree**, where the structure is exactly the same than a simple
**B+tree**, except that we keep old versions alive, until nobody use them. The idea is that
a new revision of the **B+tree** is added only when the updates are fully done. This has the
extremely intersting characteristic that there is no need of locks for read and writes, assuming
that writes are not concurrent (they are serialized).
+This is a new 'style' of **B+tree**, where the structure is exactly the same than a simple
**B+tree**, except that we keep old versions alive, until nobody use them. The idea is that
a new revision of the **B+tree** is added only when an update is fully done. This has the
extremely intersting characteristic that there is no need of locks for read and writes, assuming
that writes are not concurrent (they are serialized).
 In other words, you may have many reads at the same time, still being able to update the

Added: directory/site/trunk/content/mavibot/user-guide/2.1-mvcc-btree.mdtext
--- directory/site/trunk/content/mavibot/user-guide/2.1-mvcc-btree.mdtext (added)
+++ directory/site/trunk/content/mavibot/user-guide/2.1-mvcc-btree.mdtext Sat Jan  2 08:38:10
@@ -0,0 +1,141 @@
+Title: 2 - MVCC B-tree
+NavUp: ../user-guide.html
+NavUpText: User Guide
+NavNext: 3-btree-management.html
+NavNextText: 3 - Mavibot B-tree management
+NavPrev: 2-btree-types.html
+NavPrevText: 2 - B-tree flavors
+Notice: Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    .
+    .
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+# 2.1 - MVCC B-tree
+**MVCC** stands for **M**ulti **V**ersion **C**oncurrency **C**control. It all boils down
to allow readers to **ALWAYS** have a consistent version of the data they reached at the time
they started to work with them. In other words, when a reader access the database, it will
work on a specific version of it, that will never change.
+The drawback is that the data may be accurate at the time the reader started to work with
them, but this reader will never see the updated data if a writer has modified them.
+Most of the time, this is a price the users are ready to pay, for at least two reasons :
+- udpates are way less frequent that reads, and the users 'accept' the fact that what they
get is not necessarily up to date
+- there are some mechanisms that workaround this problem, checking afterward that the data
represent the latest state of the database, otherwise the operation is retried.
+You can read more about **MVCC** **B-tree** on [Wikipedia, MVCC](
+## How does it work ?
+A **MVCC** **B-Tree** is just a plain standard **B+Tree**, except that we keep a track of
more than one version of the **B+Tree**. Actually, we have multiple roots, one per version.
This can bee seen in this picture :
+![MVCC sample](images/R4-node.png)
+Here, we have 4 different versions, from r1 to r4. Each one of them might be used at the
same time by one or many readers. In any case, the writer update only the latest one, creating
a new revision doing so. In this exemple, the writer has just created r4 updating r3, adding
the element 'I'. As we can see, some pages are shared across the revisions : the leaf contaning
the element 'A' (version r3) is also referenced by the revision r4. This is a key : that means
a page (**Node** or **Leaf** may be used by many revisions).
+We don't delete old revision until they are not anymore in use. And even so, we don't delete
all the associated pages blindly : they may be referenced by newer versions that are still
+### Browsing the Database
+So we want to fetch some information from a B-Tree. That's just a matter of creating a *ReadTransaction*,
which will grab the latest existing revision for thi **B-Tree** and allow you to browse data
from this **B-Tree**. One you are done with the **B-Tree**, you just have to close the *ReadTransaction*
(this is critical, because keeping it opened will pin the revision, forbidding the removal
of old pages. Although if you forget to do it, the revision will be released after a period
of time). This is a description of the typical code :
+    open a ReadTransaction
+    get the B-Tree you want to read
+    read the B-tree
+    ...
+    close the ReadTransaction
+It's important to note that the transaction covers all the access on all the managed **B-Tree**.
There is no need to open a new *ReadTransaction* if you want to fetch some other information
from another **B-Tree**. That also mean you can be sure that you have a consistent view of
all the database for a given revision.
+You may have many threads reading data from many **B-Trees**, and we will have as many *ReadTransaction*
created for that purpose. As we may keep a *readTransaction* opened for quite a long period
of time, it's likely that at some point, the various Threads will use different versions of
the **B-Trees**.
+### Updating the Database
+This is a bit more complex. Here, you need a *WriteTransaction*, and there can be only one
running across all the application. That means the access to the Database is serialized, when
it comes to update it. The direct consequence is that writes are **slow**. 
+Again, the typical call is like :
+    open a WriteTransaction
+    get the B-Tree you want to update
+    update the B-tree
+    ...
+    close the WriteTransaction
+and again, many **B-Trees** might be updated during this *writeTransaction*. For a system
like a **LDAP** server, this is critical, because updating an entry is impacting many **B-trees**,
and we want all of them to be consistent globally : eitehr all the **B-trees** are updated,
or none of them.
+It's even more critical to close the *writeTransaction* because all the other writers are
waiting for this transaction to be completed to be able to proceed. 
+Let's now describe how an update works, because it's quite convoluted.
+#### Updates, from the tranchees...
+Updating a **B-Tree** is not a simple operation. Actually, we want to guarantee that the
operation succeed or fail, but in any case, leave the database in a consistant state. We also
want to guarantee that even if a crash occurs in the middle of a write will leave the database
in a consistant state. Let's see how it's possible.
+A **B-Tree** is composed of 3 elements :
+    * The *BTreeHeader*, which contains the reference on the two other elements
+    * The *BTreeInfo*, which contains informations relative to this **B-Tree** (this element
is never updated)
+    * The *RootPage*, which is the root of the **B-Tree**
+Each of these elemenst are stored into *Pages*, and referenced through the offset of those
*Pages* on disk (the offset is a long, and it's just the position of a page in the database
+The very first step is to retrieve the latest **B-Tree** revision. We have a management **B-Tree**
that keeps a track of all the active revisions for every **B-Tree** : the **BtreeOfBtrees**,
called **BOB**. It stores a reference on each **B-tree**'s header for every active revision.
The key is a composition of the **B-Tree** name and its revision : *<name, revision>
and it reference the *BTreeHeader*'s offset (the position of the *BtreeHeader*'s page). Fo
that, we browse the *BOB* looking for the newest <Name, Revision> couple (actually,
we are looking for the first tuple that is lower that <name, infinite>).
+Once we have found the tuple, we know where to fetch the *BtreeHeader* from. So do we. We
can then fetch the *RootPage*, which is the starting point for any update. 
+At this point, the update is either an addition or a removal. That will potentially change
the whole structure of the **B-Tree**, or a single page. In any case, we will have a new *RootPage*
pointing to old pages and to newly created pages. Let's see again the previous picture :
+![MVCC sample](images/R4-node.png)
+Let's say we are at revision 3, and we want to add an entry 'I' : as we can see, the *rootPage*
is updated, the right **Leaf** is split in two, and the left **Leaf** is unchanged. All in
all, we have created 3 new pages, and 2 old pages are now unused and can be reclaimed later.
The internal modification of pages is not in the scope of this explaination, so we will leave
it for the moment.
+Here is an animated image that shows how it's done :
+![R3 to R4](images/r3tor4.gif)
+So now, we have a new *BtreeHeader*, revision 4. We have to keep a track of it, which means
we have to inject it into the **BOB**. As it's a **B-Tree**, we proceed with it the exact
same way, except that we always have a reference on the **BOB** *BtreeHeader* in the transaction
+<DIV class="note" markdown="1">
+We will see later on what is the Transaction Context
+We are almost done. We also have to keep a track of the 'old' pages that were copied. In
order to be able to reclaim them, we have to store them somewhere convenient. An 'old' page
is not necessarily 'old' for everyone : another thread might have a reader on revision 3,
thus might need those 'old' pages. Actually, reclaiming 'old' pages is a bit complex and will
be explained later on. Enough said that we want to keep a track of every of those pages in
a **B-Tree**, to be able to reclaimn them later.
+We use a specific **B-Tree** for taht : the **CopiedPageBtree**, or **CPB**. The key is different
than for the **BOB**, it's a composition of the revision we are creating, and the name of
the **B-Tree** being updated : <revision, name>. The value is the offset of the copied
page. This way, when the revision is not anymore in use by any reader, we can reclaim the
associated pages.  
+The **CPB** is a **B-Tree**, so the update process is the one we already have described.
There is one big difference though : once the **CPB** has been updated, we can immediately
reclaim the copied pages, we won't use them at all (this is because there is only one writer
thread, and this is the only one allowed to manipulate this **B-Tree**).
+We are almost done now. At this point, the targetted **B-Tree** has been updated, and the
**BOB** and **CPB** have also been updated. But no reader can see those updated versions yet.
We need to update the **RecordManagerHeader**, a structure that contain the reference to the
current **BOB**. Once we have updated this structure, and wrote it on disk, we can safely
make the new **BOB** available for any new readers.
+#### Reclaiming unused pages
+As we have seen in the previous paragraph, we always create new pages, we never remove any.
This is obviously going to eat a lot of disk space if we don't reclaim unused pages !
+Hopefully, there is a *reclaimer* thread that is executed when no writerer is running, which
role is to check if some old pages can be removed. There are strict conditions for a page
to be candidate for removal :
+    * It must be presnet in the **CPB**
+    * The revision that has copied those pages must not be in use by any reader
+    * The revision must be the oldest one
+The latest condition is the critical one : it means that if a revision is used for a long
period of time, we will not be able to reclaim any pages from this revision nor from any younger
revision, even if they are not in use. That may bloat the Database...
+There are ways to deal with this problem. Typically, if a page is copied twice, then we can
reclaim it if it's not referenced. The algorithm is a bit complex though...
+The easiest solution is simply to forbid long read transactions.
+### Dealing with alive revisions
+Every time we have a reader starting a *readTransaction*, we have to keep a track of the
current revision, and the number of readers using it. This is done using a specific List of
Used Revisions. Every new reader will add a new revision to this list, or increment a counter
associated with the existing revision. Everytime a reader close the *readTransaction*, the
counter is decremented. When this counter go down to 0, the revision can be discarded, unless
it's the latest one.
+Discarding an old revision is possible only when it's the oldest. That means the list may
contain many inactive revisions, with a counter equal to 0. Those inactive revisions will
be removed when they will become the oldest revision.
\ No newline at end of file

Added: directory/site/trunk/content/mavibot/user-guide/images/R4-node.png
Binary file - no diff available.

Propchange: directory/site/trunk/content/mavibot/user-guide/images/R4-node.png
    svn:mime-type = application/octet-stream

Added: directory/site/trunk/content/mavibot/user-guide/images/r3tor4.gif
Binary file - no diff available.

Propchange: directory/site/trunk/content/mavibot/user-guide/images/r3tor4.gif
    svn:mime-type = application/octet-stream

View raw message