lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3381) Sandbox remaining contrib queries
Date Thu, 18 Aug 2011 10:32:27 GMT


Mark Harwood commented on LUCENE-3381:

It's more nuanced than averaging IDF of variants (as discussed at length in LUCENE-329).
To summarise: the original search term is the closest thing we have to the user's intent.
If we average its IDF against all fuzzy variants it is most likely to dilute this value with
a set of rare terms (most terms in the termEnum are e.g. typos) that happen to share some
When sitting this sort of expanded fuzzy query alongside other search terms in a BooleanQuery
(e.g. robert~ OR muir) we end up making the "robert~" branch of the query look comparatively
rare compared to the straight "muir" term thanks to the IDF dilution from a hundred rare "robert"
variations. In my view the correct fix is to use the root term's IDF only (assuming "robert"
actually exists in the index otherwise we must drop back to the average of variants).

That's the trick employed by FuzzyLikeThis that stops my customers complaining about "bad
fuzzy matches".

> Sandbox remaining contrib queries
> ---------------------------------
>                 Key: LUCENE-3381
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Chris Male
>         Attachments: LUCENE-3381.patch
> In LUCENE-3271, I moved the 'good' queries from the queries contrib to new destinations
(primarily the queries module).  The remnants now need to find their home.  As suggested in
LUCENE-3271, these classes are not bad per se, just odd.  So lets create a sandbox contrib
that they and other 'odd' contrib classes can go to.  We can then decide their fate at another

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message