lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-329) Fuzzy query scoring issues
Date Thu, 27 Jan 2011 15:23:30 GMT


Robert Muir commented on LUCENE-329:

Mark, I tend to agree, but at the same time I think you can safely implement
a RewriteMethod to do whatever you want? (e.g. apply the logic of FuzzyLikeThis)

Doing something special with IDF is really specific to certain Similarities, for example
your Similarity might not use the traditional IDF at all, but something involving
totalTermFreq and sumOfTotalTermFreq (like language modelling).

So I am concerned about doing tricky things with the scoring system by default 
for this query... we provide the simple options in core (Scoring, BoostOnly, etc) though.

An idea would be to factor the logic out of FuzzyLikeThisQuery into a FuzzyLikeThisRewriteMethod,
so you could just call .setRewriteMethod on your fuzzy query and use it.

> Fuzzy query scoring issues
> --------------------------
>                 Key: LUCENE-329
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2rc5
>         Environment: Operating System: All
> Platform: All
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 3.1, 4.0
>         Attachments: patch.txt
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message