lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Elworthy" <>
Subject Speed of indexing
Date Mon, 25 Mar 2002 13:08:42 GMT
I was wondering if there are tricks for making indexing faster in
Lucene. I have a program which reads XML "documents" from a file, and
indexes the 7 or so fields which occur in them. Most of the fields are
very short, and the one long one averages a few hundred words.

To index 20000 such records takes 615 seconds. I use an IndexWriter with
a String as the first argument, i.e. indexing directly to disc. If I
change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
it drops to 249s. These times are not bad in absolute terms, but the
20000 records represents only about 2% of my data, so indexing the whole
lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
consumers of processing time are:
 22.2%     5  + 13172
 16.1%     4  +  9567
 13.3%     4  +  7880
  8.1%     5  +  4818
  7.2%  4293  +     9
  5.8%     5  +  3426

I believe all of these are calls from Lucene as I don't use any of the
above methods in my own code. readBytes and writeBytes I can believe,
but why so much time on open and close? Incidentally with
mergeFactor=1000, the biggest consumers are
 29.7%     0  +  6729
 19.0%  4296  +    12

As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
release). I have a generic indexer program which differs only in the
specific indexing calls for AV and Lucene. For the same 20000 records,
it took only 57 seconds. This, I feel, does not speak well to Doug's
comment in the Lucene FAQ that indexing in Lucene is very fast. If
anyone has ideas for making it faster, I'd be interested to hear them.

-- David Elworthy

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message