lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Performance guarantees and index format
Date Tue, 05 Feb 2008 02:45:02 GMT
Chris Hostetter wrote:
> : What this issue doesn't discuss is what to do with partial results obtained
> : when a timeout occurred. As the original poster points out, document lists are
> : traversed in the order they were added and not the order of their importance,
> : which introduces a bias to partial results in that they reflect results from a
> : random sample (which is likely not the most relevant, i.e. there could have
> : been more relevant results later in the traversal order).
> : 
> : The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
> skimming this it doesn't seem like a refactored version that was less 
> nutch specific cold make a handy contrib ... but it also seems like there 
> may be a simpler approach for the (i assume) common case of prefering docs 
> that were indexed later....
> if we eliminate the requirement for *strict* preference of recent 
> documents and make that a more loose desire, then we coulnd't we do a 
> pretty good job if we just changed Segment merging to reorder reverse the 
> order of the segments before each merge?  it wouldn't be very useful to 
> start doing this on an index that's already a decent size, but if this was 
> happening on every merge right from the very begining, then the most 
> recent documents would percollate to the front of the index right?
> The only downside i can think of would be that docids would frequently 
> (not not very predictably) change even if there were no deletions .. but 
> you'd pay that same penalty with something like the nutch's IndexSorter.
> I'm not much of an expert on segment merging.. but with the exception of 
> docid order i can'tthink of many reasons why there couldn't be a merger 
> that revesed the order of hte segments.

I think this would be too messy - currently we can be sure of the simple 
rule that documents added to the index get incrementally higher docids, 
i.e. the higher the docid the more recent is the document. I think it 
would be much simpler to write a FilterIndexReader that simply reverses 
the order of docids.

The issue with Nutch's IndexSorter is that it allows you to reorder 
docids in an arbitrary manner, using a user-supplied mapping between old 
and new docids, which can be based on values retrieved from the current 
index or from any other source. So I think this would be the main value 
of this class applicable to various scenarios.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message