lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Timm" <tim...@aol.com>
Subject Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?
Date Thu, 10 May 2007 01:59:55 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title></title>
</head>
<body bgcolor="#ffffff">
<font face="Arial,sans-serif"><font size="2">Yes, for good (hopefully)
or bad.<br>
<br>
-Sean<br>
<br>
<span type="cite">Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM:</span>
</font></font>
<p><font face="Arial,sans-serif" size="2"></font></p>
<blockquote type="cite"  style="border-left: thin solid blue; padding-left: 10px; margin-left:
0pt;"><font  face="Arial,sans-serif" size="2">Interesting..<br>
Surrogates can also bring the searcher's subjectivity (opinion and
context) into it by the learning process ?<br>
shridhar<br>
  <br>
Sean Timm wrote:
  </font>
  <blockquote cite="mid:463D2A76.2090902@aol.com" type="cite"><font  face="Arial,sans-serif"
size="2"> It may not be easy or even possible
without major changes, but having
global collection statistics would allow scores to be compared across
searchers.&nbsp; To do this, the master indexes would need to be able to
communicate with each other.<br>
    <br>
An other approach to merging across searchers is described here:<br>
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, <a moz-do-not-send="true"  href="http://ir.iit.edu/%7Eabdur/publications/p19-beitzel.pdf">
"Surrogate Scoring for Improved Metasearch Precision"</a> , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.<br>
    <br>
-Sean<br>
    <br>
    <a moz-do-not-send="true" class="moz-txt-link-abbreviated"  href="mailto:deinspanjer@gmail.com">deinspanjer@gmail.com</a>
wrote: </font>
    <meta content="text/html;charset=ISO-8859-1"  http-equiv="Content-Type">
    <blockquote  cite="midf81fa4180705050718lebb9119t964003b7a8f2be3f@mail.gmail.com" 
type="cite"><font face="Arial,sans-serif" size="2">On 4/11/07, Chris
Hostetter <a moz-do-not-send="true" class="moz-txt-link-rfc2396E"  href="mailto:hossman_lucene@fucit.org">&lt;hossman_lucene@fucit.org&gt;</a>
wrote: <br>
      </font>
      <blockquote type="cite"><font face="Arial,sans-serif" size="2"><br>
        <br>
A custom Similaity class with simplified tf, idf, and queryNorm
functions <br>
might also help you get scores from the Explain method that are more <br>
easily manageable since you'll have predictible query structures hard <br>
coded into your application. <br>
        <br>
ie: run the large query once, get the results back, and for each result
        <br>
look at the explanation and pull out the individual pieces of hte <br>
explanation and compare them with those of hte other matches to create <br>
your own "normalization". <br>
        </font></blockquote>
      <font face="Arial,sans-serif" size="2"> <br>
      <br>
Chuck Williams mentioned a proposal he had for normalization of scores
that <br>
would give a constant score range that would allow comparison of
scores. <br>
Chuck, did you ever write any code to that end or was it just
algorithmic <br>
discussion? <br>
      <br>
Here is the point I'm at now: <br>
      <br>
I have my matching engine working.&nbsp; The fields to be indexed and the
queries <br>
are defined by the user.&nbsp; Hoss, I'm not sure how that affects your idea
of <br>
having a custom Similarity class since you mentioned that having
predictable <br>
query structures was important... <br>
The user kicks off an indexing then defines the queries they want to
try <br>
matching with.&nbsp; Here is an example of the query fragments I'm working
with <br>
right now: <br>
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] <br>
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 <br>
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) <br>
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 <br>
director_name_mv:${Director}~.7 <br>
      <br>
For each item in the source feed, the variables are interpolated (the
query <br>
term is transformed into a grouped term if there are multiple values
for a <br>
variable). That query is then made to find the overall best match. <br>
I then determine the relevance for each query fragment.&nbsp; I haven't
written <br>
any plugins for Lucene yet, so my current method of determining the <br>
relevance is by running each query fragment by itself then iterating
through <br>
the results looking to see if the overall best match is in this result
set. <br>
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a <br>
configured fragment weight. <br>
      <br>
Since the scores aren't normalized, I have no good way of determining a
poor <br>
overall match from a really high quality one. The overall item could be
the <br>
first item returned in each of the query fragments. <br>
      <br>
Any help here would be very appreciated. Ideally, I'm hoping that maybe
      <br>
Chuck has a patch or plugin that I could use to normalize my scores
such <br>
that I could let the user do a matching run, look at the results and <br>
determine what score threshold to set for subsequent runs. <br>
      <br>
Thanks, <br>
Daniel <br>
      <br>
      </font></blockquote>
    <font face="Arial,sans-serif" size="2"></font></blockquote>
  <font face="Arial,sans-serif" size="2"></font></blockquote>
</body>
</html>


Mime
View raw message