lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Makeev <makeev...@mail.ru.INVALID>
Subject Too long token is not handled properly?
Date Fri, 11 Nov 2016 11:23:28 GMT
Hello,

I'm using lucene 6.2.0 and expecting the following test to pass:

import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.standard.StandardTokenizer;

import java.io.IOException;
import java.io.StringReader;

public class TestStandardTokenizer extends BaseTokenStreamTestCase
{
    public void testLongToken() throws IOException
    {
        final StandardTokenizer tokenizer = new StandardTokenizer();
        final int maxTokenLength = tokenizer.getMaxTokenLength();

        // string with the following contents: a...maxTokenLength+5 times...a abc
        final String longToken = new String(new char[maxTokenLength + 5]).replace("\0",
"a") + " abc";

        tokenizer.setReader(new StringReader(longToken));
        
        assertTokenStreamContents(tokenizer, new String[]{"abc"});
        // actual contents: "a" 255 times, "aaaaa", "abc"
    }
}

It seems like StandardTokenizer considers completely filled buffer as a successfully extracted
token (1), and also includes tail of too-long-token as a separate token (2). Maybe (1) is
disputable (I think it is bug), but I think (2) is a bug. 

Best regards,
Alexey Makeev
makeev_1c@mail.ru
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message