lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-8121) UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched positions
Date Tue, 09 Jan 2018 22:01:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Smiley updated LUCENE-8121:
---------------------------------
    Attachment: LUCENE-2287_UH_SpanCollector.patch

* Added Passage.toString, useful for debugging and in tests
* Rewrote a large chunk of my last patch in PhraseHelper.  I want to prevent the same term
in different SpanQueries from yielding two OffsetsEnum for the same term with different freqs.
 I could get into the nitty gritty but anyone who is curious just read the (commented) patch.
 I removed the two methods I had taken from Luwak since this refactoring didn't mesh with
the API contract.
* I resolved the nocommits related to offset storage principally by simply having the value-side
of the map be the SpanCollectedOffsetsEnum which was modified a bit to not be immutable such
that the collector adds to it and then isn't modified.  I use postingsEnum.freq() to size
the int arrays; no resizing needed. I'm really happy with that versus some other things I
tried.  In the future it shouldn't be hard to add payload support.
* The patch has a bunch of changes to TestUnifiedHighligher & TestUnifiedHighlighterMTQ
which are improvements to test randomization and not strictly for this patch.

Note that this change will cause passage scores that involve position-sensitive queries to
be a little different.  The old methodology wrapped the PostingsEnum for each position-sensitive
term in a Spans and used the freq of the underlying term (even if we'd match this term fewer
than freq times due to position sensitivity).  Now the freq for position-sensitive terms is
accurate -- usually smaller, which will amount to higher scores for passages.

I think it's ready and I'll commit in a day or two.

> UnifiedHighlighter can highlight terms within SpanNear clauses at unmatched positions
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8121
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8121
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 7.3
>
>         Attachments: LUCENE-2287_UH_SpanCollector.patch, LUCENE-2287_UH_SpanCollector.patch
>
>
> The UnifiedHighlighter (and original Highlighter) highlight phrases by converting to
a SpanQuery and using the Spans start and end positions to assume that every occurrence of
the underlying terms between those positions are to be highlighted.  But this is inaccurate;
see LUCENE-5455 for a good example, and also LUCENE-2287.  The solution is to use the SpanCollector
API which was introduced after the phrase matching aspects of those highlighters were developed.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message