lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <>
Subject RE: Sorting
Date Wed, 02 Aug 2006 09:38:17 GMT
> Scorers are by contract expected to score docs in docId order

This was my missing link. Now it makes sense to me to use a buffered
RandomAccessFile and not bother with the presort.

Many thanks, Chris, that was very well explained. 

I'll have a crack at a lean-memory SortComparatorSource implementation,
which uses a buffered RandomAccessFile, as described.

-----Original Message-----
From: Chris Hostetter [] 
Sent: 02 August 2006 04:32
Subject: RE: Sorting

: I'm with you now. So you do seeks in your comparator. For a large index
: might as well use for the "array", because there
: would be little value in buffering when the comparator is liable to jump

yep .. that's what i was getting at ... but i'm not so sure that buffering
won't be usefull.  I've i'm not mistaken, all Scorers are by contract
expected to score docs in docId order so when your hits are being collected
for sorting, you should allways be moving forward in the file
-- but you may skip ahead alot when the result set isn't a high percentage
of the total number of docs.
(i may be wrong about all Scorers going in docId order ... if you explicilty
use the 1.4 BooleanScorer you may not get that behavior, but i think
everything else works that way ... perhaps someone else can verify

: around the file. This sounds very expensive, though. If you don't open a
: Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
: reduce the number of seeks. That was the half-baked idea of the third
: which essentially orders document IDs.

presort on what exactly, the field you want to sort on?  -- That's
esentially what the TermEnum is.  I'm not sure how having that helps you ...
let's assume you've got some data structure (let's not worry about the
file/ram or TermEnum distinction just yet) containing every document in your
index of 100,000,000 products sorted on the price field, and you've done a
search for "apple" and there are 1,000,000 docIds for matching products
ready to be collected by your new custom Scoring code ... how does the full
list of all docIds sorted by price help you as you are given docIds and have
to decide if that doc is better or worse then the docs you've already

: > Bear in mind, there have been some improvements recently to the ability
: grab individual stored fields per document....
: I can't see anything like that in 2.0. Is that something in the Lucene
: build?

I guess so ... search the java-dev archives for "lazy field loading" or
"Fieldable" .. that should find some of the discussion about it and the jira
issue with the changes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message