lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Token implementation
Date Mon, 19 May 2008 09:44:30 GMT
I agree the situation is not ideal, and it's confusing.

This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens

The idea was to allow Token to keep both text or char[] and sometimes
both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)

Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.  Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...

More responses below:

DM Smith <> wrote:
> I think the Token implementation as it stands can use some improvement and
> I'd be willing to do it. I'd like some input, though. Especially because it
> is core to Lucene.
> I've been working on eliminating deprecations from my user code and I ran
> across Token.getText() as being deprecated.
> This is not about my code, but the code in Token.
> In Token, it allows one of two representations of the actual token, either
> an immutable String, or a mutable char[]. One can flip back and forth
> between these all to easily.
> termText() is deprecated so termBuffer() is suggested as a replacement.
> Calling termBuffer() will potentially copy the text out of the String
> termText and into the newly created buffer and return it.
> Calling setTermText(str), which is not deprecated, will drop the buffer and
> save the str in termText.
> It appears that the invariant that is trying to be established is
> either termText or termBuffer holds the content, but not both.
> However, termBuffer() potentially violates this by loading termText
> with the termBuffer, but not nulling out the buffer.

Actually, both are allowed to be set, as long as they are the same.
So termBuffer() is allowed to leave both non-null.

> Also, in my code, I am not manipulating char[] so getting the buffer, I need
> to immediately convert it to a string to process it. And then when I'm done,
> I have a String possibly of some other length. To stuff it back into the
> termBuffer, requires a call to:
> setTermBuffer(s.toCharArray(), o, s.length())

It would be better to call Token.resizeTermBuffer(...), then
s.getChars into the Token's term buffer (saves a buffer copy).

> I was looking at this in light of TokenFilter's next(Token) method and how
> it was being used. In looking at the contrib filters, they have not been
> modified. Further, most of them, if they work with the content analysis and
> generation, do their work in strings. Some of these appear to be good
> candidates for using char[] rather than strings, such as the NGram filter.
> But others look like they'd just as well remain with String manipulation.

It would be great to upgrade all contrib filters to use the re-use APIs.

> I'd like to suggest that internally, that Token be changed to only use
> char[] termBuffer and eliminate termText.

The question is what performance cost we are incurring eg on the
contrib (& other) sources/filters?  Every time setTermText is called,
we copy out the chars (instead of holding a reference to the String).
Every time getText() is called we create a new String(...) from the
char[].  I think it's potentially a high cost, and so maybe we should
wait until 3.0 when we drop the deprecated APIs?

> And also, that termText be restored as not deprecated.

It made me nervous keeping this method because it looks like it should
be cheap to call, and in the past it was very cheap to call.  But,
maybe we could keep it, if we mark very very clearly in the javadocs
the performance cost you incur by using this method (it makes a new
String() every time)?

> But, in TokenFilter, next() should be deprecated, IMHO.

I think this is a good idea.  After all if people don't want to bother
using the passed in Token, they are still allowed to return a new

> I have also used a better algorithm than doubling for resizing an
> array. I'd have to hunt for it.

That would be great!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message