lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Ramón Pérez Agüera <>
Subject Re: problem in Lucene's ranking function
Date Wed, 05 May 2010 18:52:14 GMT
Hi Robert,

the problem is not the linear combination of fields, the problem is to
apply the boost factor per field after the term frequency saturation
function and then make the linear combination of fields. Every system
that implement BM25F, including terrier, take care of that, because if
you don't do it you have a bug in your ranking function and not just a
different ranking function.

It is very easy solve the problem in the current Lucene ranking
function, just move the boost factor per field inside the term
frequency square root, that's all. If you implement this little
change, Lucene ranking fucntion will work properly with structured
documents and all your other concerns about allowing users to
implement different ranking functions for different situations will be
not affected by this change.

I really appreciate your work to improve Lucene ranking function, and
your time to response this emails :-)



On Wed, May 5, 2010 at 2:12 PM, Robert Muir <> wrote:
> 2010/5/5 José Ramón Pérez Agüera <>
>> Hi Robert,
>> thank you very much for your quick response, I have a couple of questions,
>> did you read the papers that I mention in my e-mail?
> Yes.
>> do you think that Lucene ranking function could have this problem?
> I know it does.
>> My concern is not about how to implement different kind of ranking
>> functions for Lucene, I know that you are doing a very nice work to
>> implement a very flexible ranking framework for Lucene, my concern is
>> about a bug, which is independent of the ranking function that you are
>> using and which appears whether some kind of saturation function is
>> used in combination with a linear combination of fields for structured
>> documents.
> I think we might disagree here though. Must 'the combining of scores from
> different fields' must be hardcoded to one simple solution, or should it be
> something that you can control yourself?
> For example, it appears Terrier implements something different for this
> problem, not the paper you referenced but a different technique?:
> But
> I don't quite understand all the subleties involved... it seems in this
> other paper there is still a linear combination, but you introduce
> additional per-field parameters.
> The thing that makes me nervous about "hardcoding/changing" the way that
> scores are combined across fields is that Lucene presents some strange
> peculiarities, most notably the ability to use different scoring models for
> different fields. This in fact already exists today, if you "omitTF" for one
> field but not for another, you are using a different scoring model for the
> two fields.
>> Maybe I'm wrong, but if the linear combination of fields remains in
>> lucene ranking function core, Lucene is never going to work properly
>> to compute the score for structured documents.
> I wouldn't say never, maybe we will not get there in the first go, but
> hopefully at least you will be able to do the things i mentioned above, such
> as using different similarities for different fields, including ones that
> are not supported today.
>> I know how to solve the problem, and we have our own implementation of
>> BM25F for Lucene which performance is much better that standard
>> Lucene's ranking function, but I think that would be useful for other
>> Lucene users to know what is the problem to deal with structured
>> documents, and how to fix this problem for the next version,
>> independently what ranking function is finally implemented for Lucene.
> It would be great if you could help us on that issue (I know the patch is a
> bit out of date), to try to fix the scoring APIs, including perhaps thinking
> about how to improve search across multiple fields for structured documents.
> In my opinion, I would like to see the situation evolve away from "which
> ranking function is implemented for Lucene" instead to having a variety of
> built-in functions you can choose from.
> So, I would rather it be more like Analyzers, where we have a variety of
> high-quality implementations available, and you can make your own if you
> must, but there is no real default.
> --
> Robert Muir

Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
Web page:
MRC website:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message