lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <>
Subject [jira] Commented: (LUCENE-1333) Token implementation needs improvements
Date Wed, 13 Aug 2008 12:54:46 GMT


DM Smith commented on LUCENE-1333:

Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing that it was cheaper
to do a new. Now, maybe that is fixed in this issue, but I am not sure.

I was going on hearsay when I uniformly used clone() rather than new when dealing with creating
a deep copy of an existing token. I was under the impression that clone was faster than new
to do equivalent work.

The test is rather simple and worthy of doing before accepting this issue. I don't think I
have time to do it today.

The equivalent of clone is (done from memory, so this is close):
Token token = new Token(oldToken.startOffset(), oldToken.endOffset(), oldToken.getFlags(),
if (oldToken.getPayload() != null) {
 Payload p = new Payload(....); // Create a new Payload with a deep copy of the payload

While this might be faster, there are two flaws with this that clone avoids, clone has direct
access to the parts and avoids method calls and also is future proof. If a new field is added
to Token, it will automatically be carried forward.

There are a couple of places in the code where:
public Token(Token prototype) // only if new is faster
public void copyFrom(Token prototype)
would beneficially solve these maintenance issues.

But how do you cope with reset()?
This problem is a bug in the existing code. Today, one can create a chain of TokenFilters,
each of which calls or, and any one of which modifies the return
value. It does not matter which is invoked. If the token returned is held in a cache then
the cache is corrupted. Every cache of Tokens needs to ensure that it's cache is immutable
on creation. It also needs to ensure that it is immutable on usage if the tokens can be served
more than once.

Two personal opinions:
* Caches that don't implement reset should return cache.remove(0) [or equivalent] so it is
clear that the cache can only be used once.
* Caches should not be used except when it gives a clear performance advantage.

> Token implementation needs improvements
> ---------------------------------------
>                 Key: LUCENE-1333
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch,
LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch,
LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch,
LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch,
LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch,
> This was discussed in the thread (not sure which place is best to reference so here are
> or to see it all at once:
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the
> 2. Deprecations are incomplete. The constructors that take String as an argument and
the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can
be over-allocated. A less aggressive algorithm would be better. In the thread, the Python
example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded
now rather than waiting for 3.0. As it stands, filter chains that alternate between char[]
and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in
contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse
is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors
that took a Token. This would simplify the use of the two together.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message