lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Updated: (LUCENE-736) Sloppy Phrase Scoring Misbehavior
Date Fri, 01 Dec 2006 10:26:22 GMT
     [ ]

Doron Cohen updated LUCENE-736:

    Attachment: sloppy_phrase_java.patch.txt

Attached sloppy_phrase_java.patch.txt is fixing the failing new tests. 
This also includes the skipTo() bug from issue 697.

The fix does not guarantee that document A B C B A would score "A B C"~4 and "C B A"~4 the
It does that for "B C"~2 and "C B"~2.
This is because a general fix for that (at least the one that I devised) would be too expensive.
Although this is an interesting case, I'd like to think it is not an important one.

This fix comes with a performance cost:  about 15% degradation in CPU activity of sloppy phrase
scoring, as the attcahed perf logs show.
Here is the summary of these tests:


I think that in a real life scenario - real index, real documents, real queries - this extra
CPU will be shaded by IO, but I also belive we should refrain from slowing down search, so,
unhappy with this degradation (anyone would:-), I would look for a other ways to fix this
- ideas are welcome.

Perf test was done using the task benchmark framework (see issue 675), The logs show also
the queries that were searched.

All tests pass with new code.

> Sloppy Phrase Scoring Misbehavior
> ---------------------------------
>                 Key: LUCENE-736
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Doron Cohen
>         Assigned To: Doron Cohen
>            Priority: Minor
>         Attachments: perf-search-new.log, perf-search-orig.log, sloppy_phrase_java.patch.txt,
> This is an extension of
> In addition to abnormalities Yonik pointed out in 697, there seem to be other issues
with slopy phrase search and scoring.
> 1) A phrase with a repeated word would be detected in a document although it is not there.
> I.e. document = A B D C E , query = "B C B" would not find this document (as expected),
but query "B C B"~2 would find it. 
> I think that no matter how large the slop is, this document should not be a match.
> 2) A document containing both orders of a query, symmetrically, would score differently
for the queru and for its reveresed form.
> I.e. document = A B C B A would score differently for queries "B C"~2 and "C B"~2, although
it is symmetric to both.
> I will attach test cases that show both these problems and the one reported by Yonik
in 697. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message