lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cao Manh Dat (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-6595) CharFilter offsets correction is wonky
Date Mon, 22 Jun 2015 02:30:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595289#comment-14595289
] 

Cao Manh Dat edited comment on LUCENE-6595 at 6/22/15 2:29 AM:
---------------------------------------------------------------

The root of problems is we mapping N -> 1 and then asking an inverse mapping 1 -> 1.

Currently CharFilter have two problems.
Problem 1:
{code}
Input :       A B C ) ) )
Output :      A B C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset
3 4 5 6 in the input. CharFilter will correct offset of C to 6 ( end of range ).

So why cccc -> cc have correct offset?
{code}
Input :     c c c c
Output :    c c
{code}
Because offset 2 (which is the second c in output) related to offset 2 3 4 in the input. CharFilter
will correct offset 2 to 4 (end of range, which is correct). 

The different of two examples, In Ex1 : the replacement happen right in the correct point
(at 3) and in Ex2 : the replacement happen before the correct point (at 0). So I store an
inputOffsets[] which is the start for each replacements.

Problem 2:
{code}
Input :   A <space> ( C
Output :  A <space> C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset
3 4 in the input. CharFilter will correct offset of C to 4 (end of range, which is correct).
But in this example the replacement also happen right in the correct point. So there is a
difference between correct startOffset and endOffset. So I add correctEndOffset method in
Tokenizer

[~dsmiley] I will look at LUCENE-5734 and try to fix that bug.


was (Author: caomanhdat):
Currently CharFilter have two problems.
Problem 1:
{code}
Input :       A B C ) ) )
Output :      A B C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset
3 4 5 6 in the input. CharFilter will correct offset of C to 6 ( end of range ).

So why cccc -> cc have correct offset?
{code}
Input :     c c c c
Output :    c c
{code}
Because offset 2 (which is the second c in output) related to offset 2 3 4 in the input. CharFilter
will correct offset 2 to 4 (end of range, which is correct). 

The different of two examples, In Ex1 : the replacement happen right in the correct point
(at 3) and in Ex2 : the replacement happen before the correct point (at 0). So I store an
inputOffsets[] which is the start for each replacements.

Problem 2:
{code}
Input :   A <space> ( C
Output :  A <space> C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset
3 4 in the input. CharFilter will correct offset of C to 4 (end of range, which is correct).
But in this example the replacement also happen right in the correct point. So there is a
difference between correct startOffset and endOffset. So I add correctEndOffset method in
Tokenizer

The root of problems is we mapping N -> 1 and then asking an inverse mapping 1 -> 1.

[~dsmiley] I will look at LUCENE-5734 and try to fix that bug.

> CharFilter offsets correction is wonky
> --------------------------------------
>
>                 Key: LUCENE-6595
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6595
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>         Attachments: LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
>   ( -> 
>   ) -> 
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
>   Output offset: 0 1 2 3
>    Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
>   cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message