lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)
Date Wed, 01 Oct 2014 08:07:11 GMT
I played with this possibility on the extremely experimental which I haven't
gotten back to for a long time...

The changes on that branch adds the idea of a "deleted token", by just
setting a new DeletedAttribute marking whether the token is deleted or
not.  Otherwise all other token attributes are visible like normal.
I.e., tokens are deleted the way documents are deleted in Lucene
(marked with a bit but not actually deleted until "later").  E.g.
StopFilter (on that branch) just sets that attribute to true, instead
of removing the token and leaving a hole.

The branch also had an InsertDeletedPunctuationTokenStage that would
detect when the tokenizer had dropped punctuation and then insert
[deleted] punctuation tokens.

This way IndexWriter could still ignore such tokens (since they are
marked as deleted), but other token filters would still see the
deleted tokens and be able to make decisions based on them...

Anyway, the branch is far far away from committing, but maybe we could
just pull off of it the idea of a "deleted bit" that we mark on a
given Token to tell IndexWriter not to index it, but subsequent token
filters would be able to see it ...

Mike McCandless

On Wed, Oct 1, 2014 at 3:08 AM, Dawid Weiss <> wrote:
> Hi Steve,
> I have to admit I also find it frequently useful to include
> punctuation as tokens (even if it's filtered out by subsequent token
> filters for indexing, it's a useful to-have for other NLP tasks). Do
> you think it'd be possible (read: relatively easy) to create an
> analyzer (or a modification of the standard one's lexer) so that
> punctuation is returned as a separate token type?
> Dawid
> On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe <> wrote:
>> Hi Paul,
>> StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation
Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version
supported by Lucene 4.1.0: <>.
>> Only those sequences between boundaries that contain letters and/or digits are returned
as tokens; all other sequences between boundaries are skipped over and not returned as tokens.
>> Steve
>> On Sep 30, 2014, at 3:54 PM, Paul Taylor <> wrote:
>>> Does StandardTokenizer remove punctuation (in Lucene 4.1)
>>> Im just trying to move back to StandardTokenizer from my own old custom implemenation
because the newer version seems to have much better support for Asian languages
>>> However this code except fails on incrementToken() implying that the !!! are
removed from output, yet looking at the jflex classes I cant see anything to indicate punctuation
is removed, is it removed and if so can i remove it ?
>>> Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new
>>> assertNotNull(tokenizer);
>>> tokenizer.reset();
>>> assertTrue(tokenizer.incrementToken());
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message