lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: Token implementation
Date Mon, 19 May 2008 15:19:24 GMT
Michael McCandless wrote:
> I agree the situation is not ideal, and it's confusing.
My problem as a user is that I have to read the code to figure out how 
to optimally use the class. The JavaDoc is a bit wanting.
> This comes back to LUCENE-969.
> At the time, we decided to keep both String & char[] only to avoid
> performance cost for those analyzer chains that use String tokens
> exclusively.
> The idea was to allow Token to keep both text or char[] and sometimes
> both (if they are storing the same characters, as happens if
> termBuffer() is called when it's a String being stored)
When termBuffer() is called termText is always null on return. This is 
the invariant of initTermBuffer() which is called directly or indirectly 
by every method that touches termBuffer.

It is only after calling termText() that one could have both. The only 
advantage I see here is that calling it twice without any intervening 
call to a termBuffer method would it return the same physical string.

After calling setTermText(newText), termBuffer is set to null.

I presume the purpose of a filter is to get the token content, and if 
modified, set the new content. If so, the result of the setXXX will be 
that either termText or termBuffer will be null.

> Then, in 3.0, we would make the change you are proposing (to only
> store char[] internally).  That was the plan, anyway.  Accelerating
> this plan (to store only char[] today) is compelling, but I worry
> about the performance hit to legacy analyzer chains...
I wonder whether it is all that significant an issue. Today, some of the 
Lucene analyzers have been migrated to using char[], while others, 
notably contrib, continue to use text.

IMHO: Prior to char[], the text was replaceable, but not modifiable. 
 From a practical perspective, Token reuse minimized the cost of 
construction, but not much else. The performance of a Token was 
predictable, but the filter was potentially sub-optimal. With char[] and 
supporting methods, the text became modifiable, too.

When a filter calls setTermText or setTermBuffer, it does not know how 
the consumer of the token will work. It could be that it stores it with 
setTermText and the next filter calls termBuffer().

I may not understand this correctly, but it seems to me that the 
following is plausible given a filter chain of Lucene provided filters 
(including contrib)
If we have a token filter chain of A -> B -> C, which uses next() in any 
part of the chain, the flow of a reusable Token is stopped. A given 
filter may cache a Token and reuse it. So consider the following scenario:
A overrides next(Token) and reuses the token via char[]
B overrides next() and has a cached Token and updates text.
C overrides next(Token) and reuses the token via char[].

First run:
After A is done, the termText in the supplied Token is null and 
termBuffer has a value.

B provides it's own token so it is not influenced by A.

C is given the token that B returns, because the caller is effectively 
using "token =", but because the token has text, a 
performance hit is taken to put it into char[]. Both text and char[] 
start out the same, but because char[] is changed, termText is set to null.

Second run:
A starts with a Token with a char[] because it is reusing the token from 
the last run or because it is using a localToken from the caller. If it 
is a localToken, then the scenario is as above. But if it is the end 
result of the first run, then A is re-using the token that is cached in 
B. Since C last modified it, it is char[].

B uses its cached Token, but it was modified by A to be char[] with null 
text. Now B takes a performance hit as it creates a new String.

C is as in the first run.

Another scenario:
A, B and C are all legacy. This would only be filter chains that are not 
provided by core Lucene as the core filter chains have been migrated. 
This would be a performance hit.

A last scenario:
A, B and C are all char[]. This would not take a performance hit.

It seems to me that in a mixed chain, that there will always be a 
performance hit. But my guess is that maintaining termBuffer once used 
would be a good thing.

So, a modified suggestion to maintain the performance but improve the 
first scenario. Do you see any problem with the following?

If termBuffer is used in a token, then it is maintained and never set to 
Note also that resizeTermBuffer(size) maintains the termBuffer. That is, 
it copies the text when the array is grown. When it is known that it is 
going to be slammed this is unnecessary. One can implement a helper 
function that merely grows the array. There are a couple of places this

Thus setTokenText would be something like:
  public void setTermText(String text) {
    termText = text;
    if (termBuffer != null) {
      termText.getChars(0, termText.length(), termBuffer, 0);

A possible implementation to grow the array would be:
private void growTermBuffer(int newSize)
  // determine the best size
  if (newSize < MIN_BUFFER_SIZE) {
    newSize = MIN_BUFFER_SIZE;

  // if the buffer exists and is too small, then determine a better size.
  // this is the current doubling algorithm. it could be better.
  int tbLength = termBuffer == null ? 0 : termBuffer.length;
  if (tbLength > 0 && newSize > tbLength) {
    int size = tbLength;
    while (size < newSize) {
       size *= 2;
     newSize = size;

  // Check to see if the buffer needs to be resized
  if (newSize > tbLength)
    termBuffer = new char[newSize];

More below....
> More responses below:
> DM Smith <> wrote:
>> I think the Token implementation as it stands can use some improvement and
>> I'd be willing to do it. I'd like some input, though. Especially because it
>> is core to Lucene.
>> I've been working on eliminating deprecations from my user code and I ran
>> across Token.getText() as being deprecated.
>> This is not about my code, but the code in Token.
>> In Token, it allows one of two representations of the actual token, either
>> an immutable String, or a mutable char[]. One can flip back and forth
>> between these all to easily.
>> termText() is deprecated so termBuffer() is suggested as a replacement.
>> Calling termBuffer() will potentially copy the text out of the String
>> termText and into the newly created buffer and return it.
>> Calling setTermText(str), which is not deprecated, will drop the buffer and
>> save the str in termText.
>> It appears that the invariant that is trying to be established is
>> either termText or termBuffer holds the content, but not both.
>> However, termBuffer() potentially violates this by loading termText
>> with the termBuffer, but not nulling out the buffer.
> Actually, both are allowed to be set, as long as they are the same.
> So termBuffer() is allowed to leave both non-null.
I'm not sure of the advantage.

>> Also, in my code, I am not manipulating char[] so getting the buffer, I need
>> to immediately convert it to a string to process it. And then when I'm done,
>> I have a String possibly of some other length. To stuff it back into the
>> termBuffer, requires a call to:
>> setTermBuffer(s.toCharArray(), o, s.length())
> It would be better to call Token.resizeTermBuffer(...), then
> s.getChars into the Token's term buffer (saves a buffer copy).

Many thanks. I saw the comment, but missed getChars. This is better, but 
even better would be growTermBuffer (above) since resizeTermBuffer 
potentially has an unnecessary copy.

>> I was looking at this in light of TokenFilter's next(Token) method and how
>> it was being used. In looking at the contrib filters, they have not been
>> modified. Further, most of them, if they work with the content analysis and
>> generation, do their work in strings. Some of these appear to be good
>> candidates for using char[] rather than strings, such as the NGram filter.
>> But others look like they'd just as well remain with String manipulation.
> It would be great to upgrade all contrib filters to use the re-use APIs.

I'll be happy to work toward that end. I think it affects my performance 

>> I'd like to suggest that internally, that Token be changed to only use
>> char[] termBuffer and eliminate termText.
> The question is what performance cost we are incurring eg on the
> contrib (& other) sources/filters?  Every time setTermText is called,
> we copy out the chars (instead of holding a reference to the String).
> Every time getText() is called we create a new String(...) from the
> char[].  I think it's potentially a high cost, and so maybe we should
> wait until 3.0 when we drop the deprecated APIs?
See above. I'll concede that. But I think that once termBuffer is used 
because of a mixed chain, it should be retained.
>> And also, that termText be restored as not deprecated.
> It made me nervous keeping this method because it looks like it should
> be cheap to call, and in the past it was very cheap to call.  But,
> maybe we could keep it, if we mark very very clearly in the javadocs
> the performance cost you incur by using this method (it makes a new
> String() every time)?
>> But, in TokenFilter, next() should be deprecated, IMHO.
> I think this is a good idea.  After all if people don't want to bother
> using the passed in Token, they are still allowed to return a new
> one.

Should the constructors in Token that take a String be deprecated, too? 
The comments seem to suggest that.

>> I have also used a better algorithm than doubling for resizing an
>> array. I'd have to hunt for it.
> That would be great!

I'm looking, but still haven't found it. :)

I'm looking into digging into this one.

Based on the outcome of this discussion, would this be one Jira issue? 
Reopening LUCENE-969? A separate issue for the contrib changes?

Since this would not be an API change, would there need to be changes to 
test cases? If so, what would you suggest?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message