directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Re: [Mavibot] BulkLoad
Date Fri, 20 Jun 2014 16:38:04 GMT
Le 20/06/2014 18:16, Howard Chu a écrit :
> Emmanuel Lécharny wrote:
>> Le 20/06/2014 14:50, Howard Chu a écrit :
>>> Emmanuel Lécharny wrote:
>>>> Hi guys,
>>>> many thanks Kiran for the OOM fix !
>>>> That's one step toward a fast load of big database load.
>>>> The next steps are also critical. We are currently limited by the
>>>> memory
>>>> size as we store in memory the DNs we load. In order to go one step
>>>> farther, we need to implement a system where we can prcoess a ldif
>>>> file
>>>> with no limitation due to the available memory.
>>>> That supposes we prcoess the ldif file by chunks, and once the
>>>> chuks are
>>>> sorted, then we process them as a whole, pulling one element from each
>>>> of the sorted list of DN and picking the smallest to inject it into
>>>> the
>>>> BTree.
>>> Why do you store the DNs in memory? Why are you sorting them?
>> We need to build the RDN index, which contains ParentIDandRDN data
>> structure, where each element is a tuple with the parentID and the
>> current RDN. That means we must have seen the parent before we can deal
>> with the children. This is why we keep the DN in memory.
> Sure, the OpenLDAP backends require this too. But reading the LDIF
> twice is ugly.

This is why I suggested not to do so.
> In our bulk load we simply lookup in the database/index to see if the
> parent DN exists yet, and if not, we (recursively) generate the
> parentID(s) and add them to the index. 
But you can't then bulk load the ParentID index.

> We also keep an in-memory list of such missing DNs for display at the
> end of the bulk load.
> Later if the parent entry is actually found in the input, we simply
> store it, using the ID that was previously generated. (Looking it up
> from the RDN index is fast.) Then remove it from the list of missing DNs.
> In short - pretend you've already seen the parent, if you haven't
> actually. Don't worry about it until you reach the end of the input,
> then you know for sure it's really missing.
There is one importnat thing : we do need the RDN index (a
ParentIdAndRdn index, in fact) to be ordered, because we use it when
processing a one level or sub-tree searches. That allows us to fetch a
limited number of candidates.

View raw message