nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-92) DistributedSearch incorrectly scores results
Date Thu, 15 Sep 2005 19:27:54 GMT
DistributedSearch incorrectly scores results
--------------------------------------------

         Key: NUTCH-92
         URL: http://issues.apache.org/jira/browse/NUTCH-92
     Project: Nutch
        Type: Bug
  Components: searcher  
    Versions: 0.8-dev, 0.7    
    Reporter: Andrzej Bialecki 
 Assigned to: Andrzej Bialecki  


When running search servers in a distributed setup, using DistributedSearch$Server and Client,
total scores are incorrectly calculated. The symptoms are that scores differ depending on
how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment
indexes (due to segment size or content differences) then scores will differ depending on
how many and which segments are deployed on a particular Server. This may lead to prioritizing
of non-relevant results over more relevant ones.

The underlying reason for this is that each IndexSearcher (which uses local index on each
Server) calculates scores based on the local IDFs of query terms, and not the global IDFs
from all indexes together. This means that scores arriving from different Servers to the Client
cannot be meaningfully compared, unless all indexes have similar distribution of Terms and
similar numbers of documents in them. However, currently the Client mixes all scores together,
sorts them by absolute values and picks top hits. These absolute values will change if segments
are un-evenly deployed to Servers.

Currently the workaround is to deploy the same number of documents in segments per Server,
and to ensure that segments contain well-randomized content so that term frequencies for common
terms are very similar.

The solution proposed here (as a result of discussion between ab and cutting, patches are
coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms
with these global IDFs. This will require one more RPC call per each query (this can be optimized
later, e.g. through caching). Then the scores will become normalized according to the global
IDFs, and Client will be able to meaningfully compare them. Scores will also become independent
of the segment content or local number of documents per Server. This will involve at least
the following changes:

* change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate
scores independently of local IDFs.

* add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document
frequencies for query terms.

* modify getSegmentNames() so that it returns also the total number of documents in each segment,
or implement this as a separate method (this will be called once during segment init)

* in DistributedSearch$Client.search() first make a call to servers to return local IDFs for
the current query, and calculate global IDFs for each relevant Term in that query.

* multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts
by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms

This solution should be applicable with only minor changes to all branches, but initially
the patches will be relative to trunk/ .

Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message