nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <>
Subject [jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results
Date Thu, 15 Sep 2005 20:21:54 GMT
    [ ] 

Doug Cutting commented on NUTCH-92:

A minor detail:

In Searcher, instead of

  int[] getDocFreqs(Term[]);

The new method will probably have to be something like

  public int[] getDocFreqs(TermSet);

And TermSet can implement Writable, as Nutch can't serialize Lucene Terms.

> DistributedSearch incorrectly scores results
> --------------------------------------------
>          Key: NUTCH-92
>          URL:
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.8-dev, 0.7
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 

> When running search servers in a distributed setup, using DistributedSearch$Server and
Client, total scores are incorrectly calculated. The symptoms are that scores differ depending
on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in
segment indexes (due to segment size or content differences) then scores will differ depending
on how many and which segments are deployed on a particular Server. This may lead to prioritizing
of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local index on
each Server) calculates scores based on the local IDFs of query terms, and not the global
IDFs from all indexes together. This means that scores arriving from different Servers to
the Client cannot be meaningfully compared, unless all indexes have similar distribution of
Terms and similar numbers of documents in them. However, currently the Client mixes all scores
together, sorts them by absolute values and picks top hits. These absolute values will change
if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in segments per Server,
and to ensure that segments contain well-randomized content so that term frequencies for common
terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, patches
are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms
with these global IDFs. This will require one more RPC call per each query (this can be optimized
later, e.g. through caching). Then the scores will become normalized according to the global
IDFs, and Client will be able to meaningfully compare them. Scores will also become independent
of the segment content or local number of documents per Server. This will involve at least
the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to
manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return
document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of documents in each
segment, or implement this as a separate method (this will be called once during segment init)
> * in DistributedSearch$ first make a call to servers to return local IDFs
for the current query, and calculate global IDFs for each relevant Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery
boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms
> This solution should be applicable with only minor changes to all branches, but initially
the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message