Interesting..
Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ?
shridhar

Sean Timm wrote:
It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers.  To do this, the master indexes would need to be able to communicate with each other.

An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir Frieder, "Surrogate Scoring for Improved Metasearch Precision" , Proceedings of the 2005 ACM Conference on Research and Development in Information Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.

-Sean

deinspanjer@gmail.com wrote:
On 4/11/07, Chris Hostetter <hossman_lucene@fucit.org> wrote:


A custom Similaity class with simplified tf, idf, and queryNorm functions
might also help you get scores from the Explain method that are more
easily manageable since you'll have predictible query structures hard
coded into your application.

ie: run the large query once, get the results back, and for each result
look at the explanation and pull out the individual pieces of hte
explanation and compare them with those of hte other matches to create
your own "normalization".


Chuck Williams mentioned a proposal he had for normalization of scores that
would give a constant score range that would allow comparison of scores.
Chuck, did you ever write any code to that end or was it just algorithmic
discussion?

Here is the point I'm at now:

I have my matching engine working.  The fields to be indexed and the queries
are defined by the user.  Hoss, I'm not sure how that affects your idea of
having a custom Similarity class since you mentioned that having predictable
query structures was important...
The user kicks off an indexing then defines the queries they want to try
matching with.  Here is an example of the query fragments I'm working with
right now:
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}]
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~)
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5
director_name_mv:${Director}~.7

For each item in the source feed, the variables are interpolated (the query
term is transformed into a grouped term if there are multiple values for a
variable). That query is then made to find the overall best match.
I then determine the relevance for each query fragment.  I haven't written
any plugins for Lucene yet, so my current method of determining the
relevance is by running each query fragment by itself then iterating through
the results looking to see if the overall best match is in this result set.
If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a
configured fragment weight.

Since the scores aren't normalized, I have no good way of determining a poor
overall match from a really high quality one. The overall item could be the
first item returned in each of the query fragments.

Any help here would be very appreciated. Ideally, I'm hoping that maybe
Chuck has a patch or plugin that I could use to normalize my scores such
that I could let the user do a matching run, look at the results and
determine what score threshold to set for subsequent runs.

Thanks,
Daniel