directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Howard Chu <>
Subject Re: [Mavibot] BulkLoad
Date Fri, 20 Jun 2014 16:16:20 GMT
Emmanuel Lécharny wrote:
> Le 20/06/2014 14:50, Howard Chu a écrit :
>> Emmanuel Lécharny wrote:
>>> Hi guys,
>>> many thanks Kiran for the OOM fix !
>>> That's one step toward a fast load of big database load.
>>> The next steps are also critical. We are currently limited by the memory
>>> size as we store in memory the DNs we load. In order to go one step
>>> farther, we need to implement a system where we can prcoess a ldif file
>>> with no limitation due to the available memory.
>>> That supposes we prcoess the ldif file by chunks, and once the chuks are
>>> sorted, then we process them as a whole, pulling one element from each
>>> of the sorted list of DN and picking the smallest to inject it into the
>>> BTree.
>> Why do you store the DNs in memory? Why are you sorting them?
> We need to build the RDN index, which contains ParentIDandRDN data
> structure, where each element is a tuple with the parentID and the
> current RDN. That means we must have seen the parent before we can deal
> with the children. This is why we keep the DN in memory.

Sure, the OpenLDAP backends require this too. But reading the LDIF twice is ugly.

In our bulk load we simply lookup in the database/index to see if the parent 
DN exists yet, and if not, we (recursively) generate the parentID(s) and add 
them to the index. We also keep an in-memory list of such missing DNs for 
display at the end of the bulk load.

Later if the parent entry is actually found in the input, we simply store it, 
using the ID that was previously generated. (Looking it up from the RDN index 
is fast.) Then remove it from the list of missing DNs.

In short - pretend you've already seen the parent, if you haven't actually. 
Don't worry about it until you reach the end of the input, then you know for 
sure it's really missing.

Never do the same work twice. The DB already maintains DNs in sorted order, 
there's no need to explicitly sort in the bulk load tool.

   -- Howard Chu
   CTO, Symas Corp. 
   Director, Highland Sun
   Chief Architect, OpenLDAP

View raw message