Hi guys,
Kiran just committed a first shot for the Mavibot bulk load feature.
It's working, at least for a limited set of data (we tested it with 30
000 entries generated with MakeLDIF).
There are still a lot to do :
 when trying to load more than 30 K entries, we get OOM. Growing the
memory is just not the right approach here
 subbtrees are not bulkloaded, resulting to a long time to build
indexes with duplicate values (like ObjectClass)
 the process will parse each entry twice
 some code cleanup is needed...
So for, the biggest issue is the first one. Every other are just minor
(and I think Kiran is already working on them).
The pb is that we won't be able to store all the read entries in memory,
so we have to preprocess them in some ways. Some constraints have to be
taken into account :
 when we inject an entry into the MasterTable, we have to add the
entryCSN and entryUUID if they aren't existing
 we also have to add the parentID in each entry, which means we have to
know which is the parent
 the RDN index is also built in a specific order
 that means we have to order the entries
 but we also have to order every indexed attribute if we want to build
the associated btrees
Most of those issues would be easy to deal with if we can sort the
various elements before starting to build the btrees.
One idea is to parse the entries from X to X+N, where N is an arbitrary
limit (it can be adapative), and create sorted files for each of the
elements (ie entries, and indexes). Once all the files created (and we
may have many), we wil just have to process them by reading the first
entry of each file, and pick the smallest one to inject it into the
associated btree. For instance :
data = G P D T N O S R B Y K
we can just deal with 4 of them in memory, so we produce 3 sorted files :
f1 : D G P T
f2 : N O R S
f3 : B K Y
Now, we can create the index using the 3 files in parallel :
1) min(f1, f2, f3) > f3 B, f3{K Y}
2) min(f1, f2, f3) > f1 D, f1{G P T}
3) min(f1, f2, f3) > f1 G, f1{P T}
4) min(f1, f2, f3) > f3 K, f3{Y}
5) min(f1, f2, f3) > f2 N, f2{O R S}
6) min(f1, f2, f3) > f2 O, f2{R S}
7) min(f1, f2, f3) > f1 P, f1{T}
8) min(f1, f2, f3) > f2 R, f2{S}
9) min(f1, f2, f3) > f2 S, f2{}
10) min(f1, f2, f3) > f1 T, f1{}
11) min(f1, f2, f3) > f3 Y, f3{}
As you can see, we are pulling the data in the right order from each of
the file, to generate a sorted index.
Of course, it has to be done for each index, which means we will have
many sets of files (one set per index).
All in all, I don't see any better way, but this is why I want your input !
