lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-user] Dynamic document boost
Date Sat, 11 Feb 2012 22:27:10 GMT
On 11/02/2012 22:18, Marvin Humphrey wrote:
> On Sat, Feb 11, 2012 at 10:03:37PM +0100, Nick Wellnhofer wrote:
>> What's the best way to apply a boost factor dynamically to a (small)
>> subset of documents?
> I would suggest using a RequiredOptionalQuery.  Have the logical results
> depend on the required_query and boost using the optional_query.
>      my $parsed_query = $query_parser->parse($user_query_string);
>      my $user_id_boost_query = Lucy::Search::TermQuery->new(
>          field =>  'user_id',
>          term  =>  $user_id,
>      );
>      $user_id_boost_query->set_boost($arbitrary_boost);
>      my $req_opt_query = Lucy::Search::RequiredOptionalQuery->new(
>          required_query =>  $parsed_query,
>          optional_query =>  $user_id_boost_query,
>      );
> If the query to identify the subset of documents is very expensive, you might
> look into using LucyX::Search::Filter to cache the results (but note that
> Filter does not cache in a clustered environment).

Thanks for pointing me to RequiredOptionalQuery. It looks very useful.

I can't model the query to identify the subset directly in Lucy. The 
subset is computed by some other code, so I think I'll end up with an 
ORQuery with about 100 terms matching a StringType field containing an 
external document id.

>> Is there a better way than to simply retrieve all the results, apply the
>> boost factor manually to the scores and sort the results again?
> I hope you don't have to resort to post-search filtering.  That's slow to
> begin with and it doesn't scale very well because of the costs of retrieving
> so many documents.  You also have to resort to non-idiomatic sorting code
> (using a priority queue rather than the Perl sort() function) if you don't
> want memory usage to balloon.

It wouldn't be too bad in my use case because the number of results is 
limited. But I'm curious what the most scalable solution would look like.


View raw message