directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Re: Bulk Loader : some ideas...
Date Sun, 17 Aug 2014 17:35:15 GMT
Le 17/08/14 17:07, Howard Chu a écrit :
> Emmanuel Lécharny wrote:
>> Hi guys,
>> sorry for having been absent the last few weeks (side project ...).
>> I have spend part of the last two days reviewing the bulkLoader tool.
>> Currently, on my machine, I'm able to process 30 000 entries in less
>> than 20 seconds (this is around 1600 added entries per second). It's
>> fats, compared to a direct injection of entries in a running LDAP
>> server, but we can do way better.
>> Here are some thoughts :
>> - first of all, we need to process the DN before being able to inject
>> entries into the master table. This is required because we have to
>> inject the ParentID attribute into each entry, which requires we have a
>> complete hierarchy available. Here, we have two options :
>> 1) read the entry's DN from the LDIF file first, ignoring the values. We
>> have a FastLdifReader class that does read the DN, and pass through the
>> other elements. We assume that we have enough memory to hold all the DNs
>> we will read (which could be a limitation when we try to bulkload tens
>> of million entriues). Once done, we can associate an ID to each RDN, and
>> get ready to inject those ID into the entries, which will be done while
>> creating the MasterTable. What if we don't have enough memory ? Well,
>> case 2
>> 2) As we may not have enough memory to hold all the DNs, we have to do
>> an external sort. Ie, we load a limited number of DNs, and once we have
>> reached the limit, we sort what we get, and save the sorted result on
>> disk. Once we have gone through al the DNs, we ends with N files that
>> are all atomically sorted, but need to be gathered. Thsi is done in a
>> second phase, by fetching the first Dn from each of the sorted files,
>> until we have no more DN to read from any file.
>> The second option will obviously be more costly, but it's guaranteed to
>> work no matter how much memory we have. And if we are lucky to have
>> enough memory to store all the elements in memory, we fail back to the
>> first option.
>> What we keep in memory is not only the DN, but the EntryUUID, so that we
>> can store it into each entry later (this will be the parentId
>> attribute).
>> At the same time, we need to gather all the entryUUID, as we need them
>> sorted to be able to create the masterTable (we have the same constraint
>> than for DNs, ie, if we don't have enough memory, then we need to create
>> temporary files). We keep a position in the original file to be able to
>> fetch the entries directly.
>> So, bottom line, at the end of this first phase, we will have a sorted
>> list of UUID and an hierarchically sorted set of DNs. If we are lucky,
>> they are in memory, and if we are very lucky (or if we don't have a lot
>> of entries to process, even the entries will be in memory).
>> - once this first phase is done, we can build the MasterTable : we just
>> have to get the entryUUID from the sorted list of entryUUID, and create
>> the masterTable from it. It's just a matter of fetching the entry from
>> disk using its position, parse it and store it, once we have added the
>> parentId and the other required system attributes (entryCSN,
>> creatorsName, and creationTime).
>> - at the very same time, we need to create the other indexes. For the
>> same reason (memory limitation), we may have to go through intermediate
>> files and process an external sort. In any case, we can delegate those
>> index creations to a dedicated thread, to benefit from the multiple
>> cores a computer has, speeding up the processing. The algorithm is no
>> different than teh one we used to create the matser table : we store in
>> memory the value associated with the entryUUID it is related to, and
>> once done, we push them into a new Table.
>> Side not though : for indexes, we may have multiple entryUUID associated
>> with a value (typical case for ObjectClass index).
>> I'm not sure we can do any better to imrpove the performances. Using
>> cache for the MasterTable should not be useful, if we can't store all
>> the entries in memory.
>> thgoughts ?
> Using two different approaches is gratuitous complexity. Using
> multiple passes and external sorting is unnecessary.
> UUIDs are generally not a sortable value, or not meaningful anyway.
> That seems like a lot of wasted effort.
Every index are using UUID to refer to entries in the MasterTable. If
it's not sorted, we ends up with full scan when we want to grab an entry
using an index.

A long time ago, we used longs to refer to entries, but we decided to
switch to UUIDs, because the entryUUID was already present in each entry
anyway. We just get rid of this spurious long, which now is the key of
the masterTable BTree. It has to be ordered, otherwise all the indexes
would be meaningless.

That may be different for OpenLDAP.

As a side note, the fact that we were using an incremental counter when
we set an ID to each entry (a long) was also due to the very nature of
the backend we were using back then (JDBM) for which we had no access to
the offset of entries on disk. Using this incremental counter forced us
to update the backend once more everytime we had to add a new entry,
which was a PITA.
> I explained OpenLDAP's approach once before but will summarize it again:
> It's a single-pass loader, and it works regardless of the entry order
> of the input LDIF. For each entry, we examine its DN and recursively
> lookup all of its parent DNs in the RDN index, from top down. For any
> missing RDN in this lookup, we create the RDN index entry anyway, and
> generate an entryID for it at that time. We also keep an in-memory
> list of missing DNs.
> If we encounter an entry later in the LDIF that corresponds to one of
> these missing DNs, the search in the RDN index will just return the
> entryID we already assigned to it. We then remove the DN from the
> missing DN list. The result is that the DB tables and entryIDs are
> generated in DN order even if the entries aren't ordered in the LDIF.

The pb with this approach is that you lose the EntryUUID stored in the
LDIF file (typically when you try to bulk load an extract done from a
replica : you want to keep this information).

We could spare this step if the LDIF file is actually ordered, but there
is no guarantee.
> At the end of the run, any slots left in the missing DN list are
> printed out in a warning message. You can still use either the bulk
> loader or ldapadd to supply these entries later.
> E.g., given an LDIF with these entries
> dn: dc=example,dc=com
> dn: cn=joe,ou=people,dc=example,dc=com
> dn: cn=foo,ou=groups,dc=example,dc=com
> dn: ou=groups,dc=example,dc=com
> dc=example,dc=com gets entryID 1.
> cn=joe,ou=people,dc=example,dc=com causes ou=people,dc=example,dc=com
> to be indexed with entryID 2, marked missing. cn=joe gets entryID 3.
> cn=foo,ou=groups,dc=example,dc=com causes ou=groups,dc=example,dc=com
> to be indexed with entryID 4, marked missing. cn=foo gets entryID 5.
> ou=groups,dc=example,dc=com finds that it's already got entryID 4,
> gets removed from missing DN list.
> ou=people,dc=example,dc=com is printed as missing. Its entryID remains
> reserved for it and will be used when it gets properly added.
> There's no redundant parsing of the LDIF required, no multiple passes
> of processing. You simply use the RDN index that you're already
> generating to keep the entry DNs properly sorted. There's no need for
> external sorts and no need to worry about what fits in memory (aside
> from the missing DN list, which you could also maintain as an external
> table if you really wanted to, but in a proper LDIF there should not
> be very many DNs in this list anyway).

That works only because the entryID are meaningless numeric values. We
ruled out that in 2008, which is one of the reason we have to go through
those two phases.

View raw message