lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <>
Subject Question about basic indexing performance improvements
Date Sun, 18 Feb 2007 03:38:34 GMT
I am taking a class in which the professor has assigned a project to take a
question answering application that was submitted by a team of students to
one of the TREC contests last year and turn it into a teaching tool. One
thing he wants to have done is add the capability for students to create a
variety of indexes with different settings in order to observe the ways in
which selecting a different index can cause the results to vary. The
application searches over a specified set of just over a million
XML-formatted documents that doesn't change, so there are no requirements at
this point for adding and deleting documents. Because the team that created
the application last year only needed to index it once (after they figured
out what parameters they wanted), they didn't need to care very much that it
took around 30 hours to index the documents one by one using a single
threaded indexing program.


Now we want to be able to index that same set of documents in much less
time. I am new to Lucene, so I am just going by what I have found so far in
the Lucene in Action book and on the internet. The section in the book on
indexing concurrency says that you can share an IndexWriter object among
several threads and that the calls from these threads will be properly
synchronized. Will this in itself improve indexing performance very much? It
seems like the synchronization that is needed for keeping the index from
being corrupted would limit how much you gain from using several threads. In
any case, my overall question is, given an indexing task of this kind, where
you don't have to worry about additions, deletions and updates of the
documents being indexed, just indexing the whole document database as a
batch each time a user wants to index it in a different way, what would be
the fastest way to do it using the various Lucene indexing tools and
features? Thanks.

Mike O'Leary

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message