lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
Date Wed, 23 May 2018 08:15:00 GMT


Alan Woodward commented on LUCENE-8273:

The elastic CI has found some reproducing seeds in TestRandomChains that look like the following:
Suite: org.apache.lucene.analysis.core.TestRandomChains
01:47:39    [junit4]   2> Exception from random analyzer: 
01:47:39    [junit4]   2> charfilters=
01:47:39    [junit4]   2>   org.apache.lucene.analysis.fa.PersianCharFilter(
01:47:39    [junit4]   2>   org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67,
01:47:39    [junit4]   2> tokenizer=
01:47:39    [junit4]   2>   org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3,
01:47:39    [junit4]   2> filters=ConditionalTokenFilter: 
01:47:39    [junit4]   2>   org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45

01:47:39    [junit4]   2>

01:47:39    [junit4]   2>   org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac,
OneTimeWrapper@7d30a80d term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
01:47:39    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings
-Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US
-Dtests.timezone=America/Anguilla -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
01:47:39    [junit4] FAILURE 0.57s J0 | TestRandomChains.testRandomChainsWithLargeStrings
01:47:39    [junit4]    > Throwable #1: java.lang.AssertionError
01:47:39    [junit4]    > 	at __randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0)
01:47:39    [junit4]    > 	at org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(

The root cause is that LookaheadTokenFilter doesn't play well with ConditionalTokenFilter
when we have stacked tokens:
- CTF works by presenting the underlying TokenStream to its wrapped filter as a series of
snippets, demarcated by tokens that don't pass the {{shouldFilter()}} test.  When a new snippet
is started (i.e. when a token that passes {{shouldFilter()}} appears after one that doesn't)
then {{reset()}} is called on the delegate, and when it stops (i.e. when a token that doesn't
pass {{shouldFilter()}} appears) then {{end()}} is called.
- This means that if we have stacked tokens, with the first not passing {{shouldFilter()}}
and the second passing it, the wrapped filter can see a TokenStream that has an initial position
increment of 0
- LookaheadTokenFilter has an explicit assertion that checks we don't have an initial posInc
of 0

I think this can be fixed by having a posInc adjustment when we're delegating, so that the
delegated snippet starts with a posInc of 1, but this is then adjusted downwards by the CTF
before it's emitted.

> Add a ConditionalTokenFilter
> ----------------------------
>                 Key: LUCENE-8273
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>         Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch,
LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch,
LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch,
LUCENE-8273.patch, LUCENE-8273.patch
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter in such
a way that it could optionally be bypassed based on the current state of the TokenStream.
 This could be used to, for example, only apply WordDelimiterFilter to terms that contain

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message