lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
Date Tue, 16 Apr 2013 00:02:19 GMT
Yes, reset was always "mandatory" from an API contract sense, but not always 
enforced in a practical sense in 3.x (no uniformly extreme negative 
consequences), as the original emailer indicated. Now, it is "mandatory" in 
a practical sense as well (extremely annoying consequences in all cases of a 
contract violation). So, I should have said that the contract was mandatory 
but not enforced... which from a practical perspective negates its mandatory 
contractual value.

-- Jack Krupansky

-----Original Message----- 
From: Uwe Schindler
Sent: Monday, April 15, 2013 11:53 AM
Subject: RE: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException


It was always mandatory! In Lucene 2.x/3.x some Tokenizers just returned 
bogus, undefined stuff if not correctly reset before usage, especially when 
Tokenizers are "reused" by the Analyzer, which is now mandatory in 4.x. So 
we made it throw some Exception (NPE or AIOOBE) in Lucene 4 by initializing 
the state fields in Lucene 4.0 with some default values that cause the 
Exception. The Exception is not more specified because of performance 
reasons (it's just caused by the new default values set in ctor previously).

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Jack Krupansky []
> Sent: Monday, April 15, 2013 4:25 PM
> To:
> Subject: Re: WhitespaceTokenizer, incrementToke()
> ArrayOutOfBoundException
> I didn't read your code, but do you have the "reset" that is now mandatory
> and throws AIOOBE if not present?
> -- Jack Krupansky
> -----Original Message-----
> From: andi rexha
> Sent: Monday, April 15, 2013 10:21 AM
> To:
> Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException
> Hi,
> I have tryed to get all the tokens from a TokenStream in the same way as I
> was doing in the 3.x version of Lucene, but now (at least with
> WhitespaceTokenizer) I get an exception:
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
>     at java.lang.Character.codePointAtImpl(
>     at java.lang.Character.codePointAt(
>     at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePoint
> At(
>     at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokeniz
> The code is quite simple, and I thought that it could have worked, but
> obviously it doesn't (unless I have made some mistakes).
> Here is the code, in case you spot some bugs on it (although it is 
> trivial):
> String str = "this is a test";
>         Reader reader = new StringReader(str);
>         TokenStream tokenStream = new
> WhitespaceTokenizer(Version.LUCENE_42,
> reader);  //tokenStreamAnalyzer.tokenStream("test", reader);
>         CharTermAttribute attribute =
> tokenStream.getAttribute(CharTermAttribute.class);
>         while (tokenStream.incrementToken()) {
>             System.out.println(new String(attribute.buffer(), 0,
> attribute.length()));
>         }
> Hope you have any idea of why it is happening.
> Regards,
> Andi
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message