lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4955) NGramTokenFilter increments positions for each gram
Date Thu, 25 Apr 2013 08:30:16 GMT


Adrien Grand commented on LUCENE-4955:

Given that offsets can't go backwards and that tokens in the same position must have the same
start offset, I think that the only way to get NGramTokenFilter out of TestRandomChains' exclusion
list (LUCENE-4641) is to fix position increments (this issue), change the order tokens are
emitted in (LUCENE-3920) and stop modifying offsets? I know some people rely on the current
behavior but I think it's more important to get this filter out of TestRandomChains' exclusions
since it causes highlighting bugs and makes the term vectors files unnecessary larger.
> NGramTokenFilter increments positions for each gram
> ---------------------------------------------------
>                 Key: LUCENE-4955
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.3
>            Reporter: Simon Willnauer
>             Fix For: 5.0, 4.4
>         Attachments: highlighter-test.patch, LUCENE-4955.patch
> NGramTokenFilter increments positions for each gram rather for the actual token which
can lead to rather funny problems especially with highlighting. if this filter should be used
for highlighting is a different story but today this seems to be a common practice in many
situations to highlight sub-term matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB since tokens
are sorted by position which causes offsets to be mixed up due to ngram token filter.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message