From Chris Hostetter <>
Subject RE: Sorting
Date Tue, 01 Aug 2006 08:37:10 GMT

: I take your point that Berkley DB would be much less clumsy, but an
: application that's already using a relational database for other purposes
: might as well use that relational database, no?

if you already have some need to access data about each matching doc from
a relational DB, then sure you might as well let it sort for you -- but
just bcause your APP has some DB connections open doesn't mean that's a
worthwhile reason to ask it to do the sort ... your app might have some
netowrk connections open to an IMAP server as well .. that doesn't mean
you should convert the docs to email messages and ask the IMAP server to
sort them :)

: I'm not really with you on the random access file, Chris. Here's where I am
: up to with my [mis-]understanding...
: I want to sort on 2 terms. Happily these can be ints (the first is an INT
: corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a
: hash of a string, used to group similar documents together within those 10
: minute timestamps). When I initially warm up the FieldCache (first search
: after opening the Searcher), I start by generating two random access files
: with int values at offsets corresponding to document IDs for each of these;
: the first file would have ints corresponding to the timestamp and the second
: would have integers corresponding to the hash. I'd then need to generate a
: third file which is equivalent to an array dimensioned by document ID, with
: document IDs in compound sort order??

i'm not sure why you think you need the third file ... you should be
able to use the two files you created exactly the way the existing code
would use the two arrays if you were using an in memory FieldCache (with
file seeks instead of array lookups) .. i think the class you want to look
at is FieldSortedHitQueue

: In a big index, it will take a while to walk through all of the documents to
: generate the first two random access files and the sort process required to
: generate the sorted file is going to be hard work.

well .. yes.  but that's the trade off, the reason for the RAM based
FieldCache is speed .. if you don't have that RAM to use, then doing
the same things on disk gets slower.

Bear in mind, there have been some improvements recently to the ability to
grab individual stored fields per document (FieldSelector is the name of
the class i think) ... i haven't tried those out yet, but they could make
Sorting on a stored field (which wouldn't require building up any cache -
RAM or Disk based) feasible regardless of the size of your result sets ...
but i haven't tried that yet.


