lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-user] Hits offset and search performarce
Date Mon, 12 Nov 2012 16:46:52 GMT
Thomas den Braber wrote on 11/12/12 3:53 AM:
> On Sun, Nov 11, 2012 at 04:19 AM, Marvin Humphrey <marvin@rectangular.com> wrote:
> 
>> I don't know how Swish-e implements sorting of hits, but this is expected
>> behavior in Lucy.
> 
> Swish-e can use presorting of attributes during indexing:
> 'By default Swish-e generates presorted tables while indexing for each property name.
This
> allows faster sorting when generating results. On large document collections this
> presorting may add to the indexing time, and also adds to the total size of the index.
> This directive can be used to customize exactly which properties will be presorted.'
> 
> Maybe this does the trick ?


Swish-e does presort attributes, but rank/score is not one of them. That is
always a per-search attribute.

ISTR an email exchange about this back when I was first using KinoSearch
(pre-Lucy), but I can't find it now.



> 
>>> I would expect that using the offset, performance should be higher because
>>> no processing needs to be done to the hits before the offset (no score
>>> calculation).
> 
>> How do you know that the hit number 5000 actually ranks 5000th in sort order
>> unless you calculate scores for all documents and perform sorting?


Swish-e calculates the score for all documents before sorting them. Just like Lucy.


> 
>> There are certain times when Lucy can avoid calculating scores -- when
>> SortSpecs do not require scores, or when documents match pure negative clauses
>> (docs matching "bar" in the query `foo AND NOT bar`).  But when you are
>> ranking documents based on score, we have to calculate a score for **every**
>> document.
> 
> Sorry I didn't mention this but I really meant sorting by attributes other the score,
like
> modification date or file size. Is calculating of the score also needed here?


No. If you look at the source for SWISH::Prog::Lucy::Searcher->search() you will
see that I always add a SortRule for 'score' but that is only so that I can show
the result, not to sort by it.



> 
>> I would assume that Swish-e and Lucy are implemented differently.  I don't
>> know what seek() does in the context of Swish-e.
> 
> Seek will fast forward through the search result without first specifying the total hits
> you want to collect and not reading the results that exists before the seek pointer.
In
> swish you also do not have to say in advance how many hits you want.


$hits->seek(10); # skip the first 9 hits

This is similar to the Lucy::Index::Lexicon->seek() method.

It would be useful to have it for Lucy::Search::Hits too, imo.


> 
> I can overcome the absence of such a command in Lucy by tweaking my program and moving
> some of my logic to an earlier stage.
> 
> I will continue my migration and will let you know if there are 'more bumps on the road'.
> 
> I can also make a more detailed performance comparison if you like.


I, for one, would be interested in hearing your thoughts, Thomas. As you might
expect, I have some experience with both. :)


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message