directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <>
Subject Bulk load profiling
Date Sun, 22 Jun 2014 15:26:03 GMT
Hi Kiran,

I did a bit of profiling today, and was able to improve the perfs by 7%.
The method I speeded up is PrepareString. I created a specific method
which does not crerate a new char[] when we are dealing with ASCII chars
only. The gain is huge.

Otherwise, most of the time is -as expected- spent in the
deserialization of entries read from the MasterTable.

At this point, I think we should think about what we can do to avoid
such cost. Most of the time, we will have enough memory to load all the
elements that will be stored into an index. I'm wondering if it would
not be better to parse the LDIF once, gather what we can in memory (but
not keeping the whole entry in memory) and build the index directly,
then process the master table.

It's not easy, because we can't know how much elements we can store in
memory, and when we reach the memory limit, then we have to do something
which is completely different. If we decide to deal with the memory
limitation from the beginning, we will pay the price and it will be
expensive. OTOH, most of the time we won't have to care about the memory
for two reasons :
- either we have to deal with a limited number of entries in the ldif file
- or we have enough memory to handle the whole file (on my computer, I
can provide 14Gb to the JVM, enough to process 5M entries if each one of
them is 1kb large)

I'm now thinking that it would be better to have 2 possible algorithm :
- a in-memory one, which does not care aboyt what could happen when we
reach the end of the memory
- a 'smarter' one which take control when we get an OOM

This can be done the same way we do with the DN parser : we have a fast
parser, which throw an exception if it sees a special case, and a full
parser. Same here, but we catch the OOM instead.

Of course, we cna probably try to 'predict' which one to use when we
start the bulk load, to avoid spending time with the in-memory process.
Or we can let the user decide.

Wdyt ?

View raw message