lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
Date Sat, 27 Jun 2015 23:19:04 GMT


Michael McCandless commented on LUCENE-6595:

bq. And do you agree this issue is the same as LUCENE-5734 ?

This looks like the same issue to me, although since HTMLStripCharFilter "knows" it's replacing
HTML entities (I think?) it could be smarter about correcting offsets, vs e.g. MappingCharFilter
which needs to be generic/agnostic as to what exactly it's remapping.

My first idea was the same idea proposed on LUCENE-5734: add a new correctEndOffset method,
which defaults to {{correctOffset(endOffset-1)+1}} but then this "fails" the cccc -> cc

[~caomanhdat]'s approach here is to store another int per correction, which is the input offset
where the correction first applied, which is a neat solution: it seems to solve my two examples,
and I think would solve LUCENE-5734 as well?  Any HTML entity that maps to empty string (e.g.
<em>, </em>, <b>, etc., I think?) would not be included within the output
token's start/endOffset, unless that entity was "inside" a token.

> CharFilter offsets correction is wonky
> --------------------------------------
>                 Key: LUCENE-6595
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>         Attachments: LUCENE-6595.patch, LUCENE-6595.patch
> Spinoff from this original Elasticsearch issue:
> If I make a MappingCharFilter with these mappings:
> {noformat}
>   ( -> 
>   ) -> 
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
>   Output offset: 0 1 2 3
>    Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
>   cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message