lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-6595) CharFilter offsets correction is wonky
Date Sat, 20 Jun 2015 10:12:00 GMT
Michael McCandless created LUCENE-6595:
------------------------------------------

             Summary: CharFilter offsets correction is wonky
                 Key: LUCENE-6595
                 URL: https://issues.apache.org/jira/browse/LUCENE-6595
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Michael McCandless


Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726

If I make a MappingCharFilter with these mappings:

{noformat}
  ( -> 
  ) -> 
{noformat}

i.e., just erase left and right paren, then tokenizing the string
"(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
with start offset 1 (good).

But for its end offset, I would expect/want 4, but it produces 5
today.

This can be easily explained given how the mapping works: each time a
mapping rule matches, we update the cumulative offset difference,
conceptually as an array like this (it's encoded more compactly):

{noformat}
  Output offset: 0 1 2 3
   Input offset: 1 2 3 5
{noformat}

When the tokenizer produces F31, it assigns it startOffset=0 and
endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
the CharFilter to correct those offsets, mapping them backwards
through the above arrays, which creates startOffset=1 (good) and
endOffset=5 (bad).

At first, to fix this, I thought this is an "off-by-1" and when
correcting the endOffset we really should return
1+correct(outputEndOffset-1), which would return the correct value (4)
here.

But that's too naive, e.g. here's another example:

{noformat}
  cccc -> cc
{noformat}

If I then tokenize cccc, today we produce the correct offsets (0, 4)
but if we do this "off-by-1" fix for endOffset, we would get the wrong
endOffset (2).

I'm not sure what to do here...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message