lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Offset bug in WordDelimiterFilter?
Date Tue, 06 Dec 2016 11:40:30 GMT
It looks like WDF strips the 's (STEM_ENGLISH_POSSESSIVE flag) but
doesn't reflect that in the end offset.

I'm not sure this is a bug, in that it seems OK to highlight the token
minus its attached English possessive?

It could be it was originally be design?

E.g. you can see it here: ... scroll
down a bit and you'll see a Python's occurrence, with only Python

But then, if you use the dedicated EnglishPossessiveFilter, it would
leave the offsets as you want (including the 's); so that's different

Maybe open an issue for discussion about what the approach should be?

Mike McCandless

On Tue, Dec 6, 2016 at 6:27 AM, Markus Jelsma
<> wrote:
> Hello - i noticed something peculiar running Lucene/Solr 6.3.0.
> The plural vaccinatieprogramma's should have a startOffset of 0 and a endOffset of 21
when passed through WordDelimiterFilter and/or stemmers but it isn't, slightly messing up
highlighted terms.
>     wdf = new WordDelimiterFilter(new CannedTokenStream(new Token("vaccinatieprogramma's",
0, 21)), DEFAULT_WORD_DELIM_TABLE, flags, null);
>     assertTokenStreamContents(wdf,
>         new String[] { "vaccinatieprogramma"},
>         new int[] { 0 },
>         new int[] { 21 });
>    [junit4] Suite: org.apache.lucene.analysis.miscellaneous.TestWordDelimiterFilter
>    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestWordDelimiterFilter
-Dtests.method=testOffsets -Dtests.seed=21AB10650E10CEB9 -Dtests.slow=true -Dtests.locale=bg-BG
-Dtests.timezone=Etc/GMT+10 -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
>    [junit4] FAILURE 0.06s | TestWordDelimiterFilter.testOffsets <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: endOffset 0 expected:<21>
but was:<19>
> I would expect the same behaviour a stemmers, the length of the term is always the length
of the original term. So if a user queries for a sigular term, the whole plural (original)
is highlighted.
> Am i missing something? Bug?
> Thanks,
> Markus
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message