nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results
Date Tue, 03 Feb 2009 15:31:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669990#action_12669990
] 

Andrzej Bialecki  commented on NUTCH-92:
----------------------------------------

Moving to 1.1 - needs further discussion, see also this thread: http://markmail.org/message/xyqdz3go6jwu4ozm

> DistributedSearch incorrectly scores results
> --------------------------------------------
>
>                 Key: NUTCH-92
>                 URL: https://issues.apache.org/jira/browse/NUTCH-92
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.1
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>         Attachments: distributed-idf-v2.patch, distributed-idf.patch
>
>
> When running search servers in a distributed setup, using DistributedSearch$Server and
Client, total scores are incorrectly calculated. The symptoms are that scores differ depending
on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in
segment indexes (due to segment size or content differences) then scores will differ depending
on how many and which segments are deployed on a particular Server. This may lead to prioritizing
of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local index on
each Server) calculates scores based on the local IDFs of query terms, and not the global
IDFs from all indexes together. This means that scores arriving from different Servers to
the Client cannot be meaningfully compared, unless all indexes have similar distribution of
Terms and similar numbers of documents in them. However, currently the Client mixes all scores
together, sorts them by absolute values and picks top hits. These absolute values will change
if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in segments per Server,
and to ensure that segments contain well-randomized content so that term frequencies for common
terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, patches
are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms
with these global IDFs. This will require one more RPC call per each query (this can be optimized
later, e.g. through caching). Then the scores will become normalized according to the global
IDFs, and Client will be able to meaningfully compare them. Scores will also become independent
of the segment content or local number of documents per Server. This will involve at least
the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to
manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return
document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of documents in each
segment, or implement this as a separate method (this will be called once during segment init)
> * in DistributedSearch$Client.search() first make a call to servers to return local IDFs
for the current query, and calculate global IDFs for each relevant Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery
boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms
> This solution should be applicable with only minor changes to all branches, but initially
the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message