lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Performance guarantees and index format
Date Thu, 31 Jan 2008 19:29:23 GMT
Mark Miller wrote:

What this issue doesn't discuss is what to do with partial results 
obtained when a timeout occurred. As the original poster points out, 
document lists are traversed in the order they were added and not the 
order of their importance, which introduces a bias to partial results in 
that they reflect results from a random sample (which is likely not the 
most relevant, i.e. there could have been more relevant results later in 
the traversal order).

The answer to this issue is org.apache.nutch.indexer.IndexSorter, which 
rearranges posting lists in an already existing index, according to an 
arbitrary measure of document importance, so that documents with smaller 
id are more "important". Then the partial hit lists obtained as a result 
of early termination will have a better chance of being more relevant.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message