lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Makeev <>
Subject Too long token is not handled properly?
Date Fri, 11 Nov 2016 11:23:28 GMT

I'm using lucene 6.2.0 and expecting the following test to pass:

import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.standard.StandardTokenizer;


public class TestStandardTokenizer extends BaseTokenStreamTestCase
    public void testLongToken() throws IOException
        final StandardTokenizer tokenizer = new StandardTokenizer();
        final int maxTokenLength = tokenizer.getMaxTokenLength();

        // string with the following contents: a...maxTokenLength+5 times...a abc
        final String longToken = new String(new char[maxTokenLength + 5]).replace("\0",
"a") + " abc";

        tokenizer.setReader(new StringReader(longToken));
        assertTokenStreamContents(tokenizer, new String[]{"abc"});
        // actual contents: "a" 255 times, "aaaaa", "abc"

It seems like StandardTokenizer considers completely filled buffer as a successfully extracted
token (1), and also includes tail of too-long-token as a separate token (2). Maybe (1) is
disputable (I think it is bug), but I think (2) is a bug. 

Best regards,
Alexey Makeev
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message