directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <>
Subject Bulk Load
Date Fri, 30 May 2014 09:35:12 GMT
Hi guys,

Kiran's work on bulk load is making huge progress. On my computer, I was
able to load 30 000 entries in around 16s, and the last modification he
is working on (ie, make teh LdifReader schema aware) will make it going
down to 15s. This is 2000 entries added per second, with all the indexes
created (10 of them). Here is the result of a run :

Loading schema using JarLdifSchemaLoader
Sorting the LDIF data...
Using dc=example,dc=com as the partition's root DN
Completed sorting, total number of entries 30002, time taken : 2462ms
Creating partition..., time taken : 174ms
Building master table...replacing old offset 17920 of the BTree master
with 70266368
, time taken : 10145ms
Building RDN index.replacing old offset 13312 of the BTree with 92644864
replacing old offset 14848 of the BTree with 113890304
, time taken : 2223ms
Clearing the sorted DN set.
Building index objectClassreplacing old offset 92649472 of the BTree
6f5011bf-c0cb-46fa-ba92-4834b0a1124f with 115933184
replacing old offset 115934720 of the BTree
6c8b2473-abf9-4ca5-bb6c-a6ab99566026 with 117976576
replacing old offset 117978112 of the BTree
e6b700bc-9c37-4436-8f18-ff7a05323f30 with 120019968
replacing old offset 10240 of the BTree with 120020992
, time taken : 252ms
Building index entryCSNreplacing old offset 1536 of the BTree with 124040192
, time taken : 142ms
Building index administrativeRole, time taken : 13ms
Building index apacheOneAlias, time taken : 18ms
Building index apacheSubAlias, time taken : 12ms
Building index apacheAlias, time taken : 10ms
Building presence index...replacing old offset 11776 of the BTree with 120022528
, time taken : 15ms
Patition building complete in 16167ms

So the most expensive operations are the MasterTable creation (2/3rd of
the time), the RDN index creation, and the entry sorting.

All those three operations are extensivly manipulating DN. We can
greatly improve the performance by speeding up the DN parsing here. How ?

Every DN will be read as a String, then parsed fully. This is most
certainly useless, as most of the DN will have the same parent. We can
keep in cache the parents, and get rid of them sparing the cost of
parsing again, and agaiun, and again, those DN. For instance, if we read
cn=test,dc=example,dc=com, we are very likely to hava already parsed
dc=example,dc=com. This can easily be checked by looking if the current
DN ends with dc=example,dc=com. If this is the case, we can then split
teh DN in two parts, the child part and the parent part, and only parse
the child part.

I think that could save a lot of processing time. Now, we have to be
careful, because we can't do that for evey single RDN we already have
parsed. At least, checking for the naming context could help.

Btw, I see that we are parsing the DN 181 000 times (15% of the spent
time). We can most certainly save most of those parsings. Here is where
we do parse the DN :
- when we parse the LdifEntry (30 000 times)
- in the FastLdifReader ( 30 000 times)
- when we parse the creratorsName attribute (it's a DN, and we do check
its syntax) ( 30 000 times)
And some other places. Here is the stack trace :,
List) 4976 181712, List) 3160 90856
1588 30002 1571
 844 30852
743 30002<init>(SchemaManager,
String[]) 1816 90856<init>(String[]) 1475
1475 60854
698 30002
341 30002

So there are some room for imprtovement in this area :-)

View raw message