lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject Design qs: search for multiple terms in document collection
Date Thu, 01 Dec 2011 16:49:09 GMT
I am trying to make some high- (and not so high) level design decisions for my app that is
supposed to check a collection of documents against a set of terms/queries. Basically, I need
to perform a triage of sorts when I would find only those docs in the collection which have
occurrences of at least one term from the term list. For those docs, I also need to find where
in the document each occurrence is, since I then need to collect a small amount of surrounding
text for a more detailed analysis.

Clearly, I will need to index the document collection using indexing classes of Lucene. This
is pretty straighforward. 

Then I will need to use the highlighting classes. In some sample cose I found online, a query
is first searched for and hits are returned. Then docids are extracted for the hits and query
is highlighted. Some questions:

Q1: Does Lucene perform essentially the same searching operation twice, first to find hits,
then to highlight? If so, does this mean that if I expect most of the docs in my collection
to contain at least one of the search terms, it might be faster for me to skip searching and
simply go over all docs, applying highlighting? Then for those docs where no hits occurred
I would simply get an empty list of relevant fragments. 

Q2: Is the same scoring mechanism used during search and during highlighting? That is, can
I be sure that if I get a hit during search, the corresponding document indeed contains my
query that will then be found dyuring highlighting?

Q3: Are there any mechanisms in Lucene that would facilitate merging of highlighting results
for two different queries against a single document? 

Q4: I did some small tests of highlighting and noticed that some of the fragments returned
for a query contained highlighted text that was quite far from the original query. For instance,
I was looking for a 3-word term and it highlighted a sequence of only 2 of these 3 words.
How can I control how close highlighted fragments should be to the original query?

Thanks much,

Ilya Zavorin

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message