lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <>
Subject Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Date Wed, 01 Oct 2014 17:42:53 GMT

Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1)
- it’s the most stable, performant, and featureful release available, and many bugs have
been fixed since the 4.1 release.

FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai,
and other languages that don’t use whitespace to denote word boundaries, except those around
punctuation.  Note that Lucene 4.1 does have specialized tokenizers for Simplified Chinese
and Japanese: the smartcn and kuromoji analysis modules, respectively.

It is possible to construct a tokenizer just based on pure java code - there are several examples
of this in Lucene 4.1, see e.g. PatternTokenizer, and CharTokenizer and its subclasses WhitespaceTokenizer
and LetterTokenizer.


On Oct 1, 2014, at 4:04 AM, Paul Taylor <> wrote:

> On 01/10/2014 08:08, Dawid Weiss wrote:
>> Hi Steve,
>> I have to admit I also find it frequently useful to include
>> punctuation as tokens (even if it's filtered out by subsequent token
>> filters for indexing, it's a useful to-have for other NLP tasks). Do
>> you think it'd be possible (read: relatively easy) to create an
>> analyzer (or a modification of the standard one's lexer) so that
>> punctuation is returned as a separate token type?
>> Dawid
>> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <> wrote:
>>> Hi Paul,
>>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation
Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version
supported by Lucene 4.1.0: <>.
>>> Only those sequences between boundaries that contain letters and/or digits are
returned as tokens; all other sequences between boundaries are skipped over and not returned
as tokens.
>>> Steve
> Yep, I need punctuation in fact the only thing I usually want removed is whitespace yet
I would to take advantage of the fact that the new tokenizer can recognise some word boundaries
that are not based on whitespace  in the case of some non western languages). I have modified
the tokenizer before but found it very diificult to understand it, is it possible/advisable
to contstruct a tokenizer just based on pure java code rather than derived from a jflex definition
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message