lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Too long token is not handled properly?
Date Fri, 11 Nov 2016 15:06:38 GMT
Hi Alexey,

The behavior you mention is an intentional change from the behavior in Lucene 4.9.0 and earlier,
when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1] and LUCENE-5400[2].

The new behavior is as follows: Token matching rules are no longer allowed to match against
input char sequences longer than maxTokenLength.  If a rule that would match a sequence longer
than maxTokenLength, but also matches at maxTokenLength chars or fewer, and has the highest
priority among all other rules matching at this length, and no other rule matches more chars,
then a token will be emitted for that rule at the matching length.  And then the rule-matching
iteration simply continues from that point as normal.  If the same rule matches against the
remainder of the sequence that the first rule would have matched if maxTokenLength were longer,
then another token at the matched length will be emitted, and so on. 

Note that this can result in effectively splitting the sequence at maxTokenLength intervals
as you noted.

You can fix the problem by setting maxTokenLength higher - this has the side effect of growing
the buffer and not causing unwanted token splitting.  If this results in tokens larger than
you would like, you can remove them with LengthFilter.

FYI there is discussion on LUCENE-5897 about separating buffer size from maxTokenLength, starting
here: <https://issues.apache.org/jira/browse/LUCENE-5897?focusedCommentId=14105729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14105729>
- ultimately I decided that few people would benefit from the increased configuration complexity.

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400

--
Steve
www.lucidworks.com

> On Nov 11, 2016, at 6:23 AM, Alexey Makeev <makeev_1c@mail.ru.INVALID> wrote:
> 
> Hello,
> 
> I'm using lucene 6.2.0 and expecting the following test to pass:
> 
> import org.apache.lucene.analysis.BaseTokenStreamTestCase;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> 
> import java.io.IOException;
> import java.io.StringReader;
> 
> public class TestStandardTokenizer extends BaseTokenStreamTestCase
> {
>     public void testLongToken() throws IOException
>     {
>         final StandardTokenizer tokenizer = new StandardTokenizer();
>         final int maxTokenLength = tokenizer.getMaxTokenLength();
> 
>         // string with the following contents: a...maxTokenLength+5 times...a abc
>         final String longToken = new String(new char[maxTokenLength + 5]).replace("\0",
"a") + " abc";
> 
>         tokenizer.setReader(new StringReader(longToken));
>         
>         assertTokenStreamContents(tokenizer, new String[]{"abc"});
>         // actual contents: "a" 255 times, "aaaaa", "abc"
>     }
> }
> 
> It seems like StandardTokenizer considers completely filled buffer as a successfully
extracted token (1), and also includes tail of too-long-token as a separate token (2). Maybe
(1) is disputable (I think it is bug), but I think (2) is a bug. 
> 
> Best regards,
> Alexey Makeev
> makeev_1c@mail.ru


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message