lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shridhar Venkatraman <Shrid...@NeemTree.com>
Subject Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
Date Mon, 07 May 2007 04:37:25 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000099">
Interesting..<br>
Surrogates can also bring the searcher's subjectivity (opinion and
context) into it by the learning process ?<br>
shridhar<br>
<br>
Sean Timm wrote:
<blockquote cite="mid:463D2A76.2090902@aol.com" type="cite">
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
It may not be easy or even possible without major changes, but having
global collection statistics would allow scores to be compared across
searchers.&nbsp; To do this, the master indexes would need to be able to
communicate with each other.<br>
  <br>
An other approach to merging across searchers is described here:<br>
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, <a moz-do-not-send="true"
 href="http://ir.iit.edu/%7Eabdur/publications/p19-beitzel.pdf">
"Surrogate Scoring for Improved Metasearch Precision"</a> , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.<br>
  <br>
-Sean<br>
  <br>
  <a moz-do-not-send="true" class="moz-txt-link-abbreviated"
 href="mailto:deinspanjer@gmail.com">deinspanjer@gmail.com</a> wrote:
  <blockquote
 cite="midf81fa4180705050718lebb9119t964003b7a8f2be3f@mail.gmail.com"
 type="cite">On 4/11/07, Chris Hostetter
    <a moz-do-not-send="true" class="moz-txt-link-rfc2396E"
 href="mailto:hossman_lucene@fucit.org">&lt;hossman_lucene@fucit.org&gt;</a>
wrote: <br>
    <blockquote type="cite"><br>
      <br>
A custom Similaity class with simplified tf, idf, and queryNorm
functions <br>
might also help you get scores from the Explain method that are more <br>
easily manageable since you'll have predictible query structures hard <br>
coded into your application. <br>
      <br>
ie: run the large query once, get the results back, and for each result
      <br>
look at the explanation and pull out the individual pieces of hte <br>
explanation and compare them with those of hte other matches to create <br>
your own "normalization". <br>
    </blockquote>
    <br>
    <br>
Chuck Williams mentioned a proposal he had for normalization of scores
that <br>
would give a constant score range that would allow comparison of
scores. <br>
Chuck, did you ever write any code to that end or was it just
algorithmic <br>
discussion? <br>
    <br>
Here is the point I'm at now: <br>
    <br>
I have my matching engine working.&nbsp; The fields to be indexed and the
queries <br>
are defined by the user.&nbsp; Hoss, I'm not sure how that affects your idea
of <br>
having a custom Similarity class since you mentioned that having
predictable <br>
query structures was important... <br>
The user kicks off an indexing then defines the queries they want to
try <br>
matching with.&nbsp; Here is an example of the query fragments I'm working
with <br>
right now: <br>
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] <br>
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 <br>
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) <br>
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 <br>
director_name_mv:${Director}~.7 <br>
    <br>
For each item in the source feed, the variables are interpolated (the
query <br>
term is transformed into a grouped term if there are multiple values
for a <br>
variable). That query is then made to find the overall best match. <br>
I then determine the relevance for each query fragment.&nbsp; I haven't
written <br>
any plugins for Lucene yet, so my current method of determining the <br>
relevance is by running each query fragment by itself then iterating
through <br>
the results looking to see if the overall best match is in this result
set. <br>
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a <br>
configured fragment weight. <br>
    <br>
Since the scores aren't normalized, I have no good way of determining a
poor <br>
overall match from a really high quality one. The overall item could be
the <br>
first item returned in each of the query fragments. <br>
    <br>
Any help here would be very appreciated. Ideally, I'm hoping that maybe
    <br>
Chuck has a patch or plugin that I could use to normalize my scores
such <br>
that I could let the user do a matching run, look at the results and <br>
determine what score threshold to set for subsequent runs. <br>
    <br>
Thanks, <br>
Daniel <br>
    <br>
  </blockquote>
</blockquote>
</body>
</html>

Mime
View raw message