lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: 3.1 upgrade problem
Date Fri, 01 Apr 2011 08:42:20 GMT
Hi Wouter,

See point 8 in the Backwards Compatibility CHANGES.txt. The reason is
explained in several issues (not all listed there), problems are e.g. in the
Unicode 4 changes, where a non-final WhitespaceTokenizer would need to do
reflection-based backwards hacks (like in 2.9 when we changed to

The Tokenization API in Lucene is decorator based, so to extend/modify it,
you can:
a) Use one of the abstract base classes
b) simply wrap a TokenFilter on top that modifies what you intend to do

For your case it seems to be the most simple thing to extend not
WhitespaceTokenizer but instead its base class CharTokenizer (which is
abstract). There you simply override isTokenChar(int) and/or normalize(int).
These methods look very simple for WhitespaceTokenizer, so simply add your
rules there. Extending WhiteSpaceTokenizer is not needed and leads to
problems, as mentioned above. Lucene 3.1's test suite checks now that
*every* TokenStream component is final (when assertions are enabled in your


Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Wouter Heijke []
> Sent: Friday, April 01, 2011 9:40 AM
> To:
> Subject: 3.1 upgrade problem
> I'm doing the upgrade to Lucene 3.1.0.
> The upgrade failed on WhitespaceTokenizer being final in this version.
> I don't understand why anyone would make this tokenizer final, I was
> happlily extending it for many Lucene versions!
> Wouter
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message