lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Token implementation
Date Mon, 19 May 2008 20:33:04 GMT
DM Smith <> wrote:
> Michael McCandless wrote:
>> I agree the situation is not ideal, and it's confusing.
> My problem as a user is that I have to read the code to figure out how to
> optimally use the class. The JavaDoc is a bit wanting.

Yeah we should fix the javadocs.

>> This comes back to LUCENE-969.
>> At the time, we decided to keep both String & char[] only to avoid
>> performance cost for those analyzer chains that use String tokens
>> exclusively.
>> The idea was to allow Token to keep both text or char[] and sometimes
>> both (if they are storing the same characters, as happens if
>> termBuffer() is called when it's a String being stored)
> When termBuffer() is called termText is always null on return. This is the
> invariant of initTermBuffer() which is called directly or indirectly by
> every method that touches termBuffer.

Sorry I meant termText().

> It is only after calling termText() that one could have both. The only
> advantage I see here is that calling it twice without any intervening call
> to a termBuffer method would it return the same physical string.


> After calling setTermText(newText), termBuffer is set to null.
> I presume the purpose of a filter is to get the token content, and if
> modified, set the new content. If so, the result of the setXXX will be that
> either termText or termBuffer will be null.

Right, though, if there's no change, and the next filter in the chain
calls termText(), we save constructing a new String by caching it.

>> Then, in 3.0, we would make the change you are proposing (to only
>> store char[] internally).  That was the plan, anyway.  Accelerating
>> this plan (to store only char[] today) is compelling, but I worry
>> about the performance hit to legacy analyzer chains...
> I wonder whether it is all that significant an issue. Today, some of the
> Lucene analyzers have been migrated to using char[], while others, notably
> contrib, continue to use text.
> IMHO: Prior to char[], the text was replaceable, but not modifiable. From a
> practical perspective, Token reuse minimized the cost of construction, but
> not much else. The performance of a Token was predictable, but the filter
> was potentially sub-optimal. With char[] and supporting methods, the text
> became modifiable, too.
> When a filter calls setTermText or setTermBuffer, it does not know how the
> consumer of the token will work. It could be that it stores it with
> setTermText and the next filter calls termBuffer().
> I may not understand this correctly, but it seems to me that the following
> is plausible given a filter chain of Lucene provided filters (including contrib)
> If we have a token filter chain of A -> B -> C, which uses next() in any
> part of the chain, the flow of a reusable Token is stopped. A given filter
> may cache a Token and reuse it. So consider the following scenario:
> A overrides next(Token) and reuses the token via char[]
> B overrides next() and has a cached Token and updates text.
> C overrides next(Token) and reuses the token via char[].
> First run:
> After A is done, the termText in the supplied Token is null and termBuffer
> has a value.
> B provides it's own token so it is not influenced by A.
> C is given the token that B returns, because the caller is effectively using
> "token =", but because the token has text, a performance
> hit is taken to put it into char[]. Both text and char[] start out the same,
> but because char[] is changed, termText is set to null.
> Second run:
> A starts with a Token with a char[] because it is reusing the token from the
> last run or because it is using a localToken from the caller. If it is a
> localToken, then the scenario is as above. But if it is the end result of
> the first run, then A is re-using the token that is cached in B. Since C
> last modified it, it is char[].
> B uses its cached Token, but it was modified by A to be char[] with null
> text. Now B takes a performance hit as it creates a new String.
> C is as in the first run.
> Another scenario:
> A, B and C are all legacy. This would only be filter chains that are not
> provided by core Lucene as the core filter chains have been migrated. This
> would be a performance hit.

Why is this a performance hit?  If they are all legacy they would all
implemented the non-reuse next() API, and would use setTermText, and
no conversion to char[] would happen (except inside DocumentsWriter)?

> A last scenario:
> A, B and C are all char[]. This would not take a performance hit.
> It seems to me that in a mixed chain, that there will always be a
> performance hit.

Right, though since we cache the String we should save on multiple
calls to termText().  And in a non-mixed chain (all new or all old) I
think there wouldn't be a hit.

> But my guess is that maintaining termBuffer once used would
> be a good thing.


> So, a modified suggestion to maintain the performance but improve the first
> scenario. Do you see any problem with the following?
> If termBuffer is used in a token, then it is maintained and never set to  null.
> Note also that resizeTermBuffer(size) maintains the termBuffer. That is, it
> copies the text when the array is grown. When it is known that it is going
> to be slammed this is unnecessary. One can implement a helper function that
> merely grows the array. There are a couple of places this
> Thus setTokenText would be something like:
>  public void setTermText(String text) {
>   termText = text;
>   if (termBuffer != null) {
>     growTermBuffer(termText.length());
>     termText.getChars(0, termText.length(), termBuffer, 0);
>   }
>  }
> A possible implementation to grow the array would be:
> private void growTermBuffer(int newSize)
> {
>  // determine the best size
>  if (newSize < MIN_BUFFER_SIZE) {
>   newSize = MIN_BUFFER_SIZE;
>  }
>  // if the buffer exists and is too small, then determine a better size.
>  // this is the current doubling algorithm. it could be better.
>  int tbLength = termBuffer == null ? 0 : termBuffer.length;
>  if (tbLength > 0 && newSize > tbLength) {
>   int size = tbLength;
>   while (size < newSize) {
>      size *= 2;
>   }
>    newSize = size;
>  }
>  // Check to see if the buffer needs to be resized
>  if (newSize > tbLength)
>  {
>   termBuffer = new char[newSize];
>  }
> }

That looks good.  Though, if Token is 100% re-used (full chain is
"new") then I would expect growing the char[] to be very low cost
(happens very few times).

> More below....
>> More responses below:
>> DM Smith <> wrote:
>>> I think the Token implementation as it stands can use some improvement and
>>> I'd be willing to do it. I'd like some input, though. Especially because it
>>> is core to Lucene.
>>> I've been working on eliminating deprecations from my user code and I ran
>>> across Token.getText() as being deprecated.
>>> This is not about my code, but the code in Token.
>>> In Token, it allows one of two representations of the actual token, either
>>> an immutable String, or a mutable char[]. One can flip back and forth
>>> between these all to easily.
>>> termText() is deprecated so termBuffer() is suggested as a replacement.
>>> Calling termBuffer() will potentially copy the text out of the String
>>> termText and into the newly created buffer and return it.
>>> Calling setTermText(str), which is not deprecated, will drop the buffer and
>>> save the str in termText.
>>> It appears that the invariant that is trying to be established is
>>> either termText or termBuffer holds the content, but not both.
>>> However, termBuffer() potentially violates this by loading termText
>>> with the termBuffer, but not nulling out the buffer.
>> Actually, both are allowed to be set, as long as they are the same.
>> So termBuffer() is allowed to leave both non-null.
> I'm not sure of the advantage.

Covered above (mixed chains).

>>> Also, in my code, I am not manipulating char[] so getting the buffer, I need
>>> to immediately convert it to a string to process it. And then when I'm done,
>>> I have a String possibly of some other length. To stuff it back into the
>>> termBuffer, requires a call to:
>>> setTermBuffer(s.toCharArray(), o, s.length())
>> It would be better to call Token.resizeTermBuffer(...), then
>> s.getChars into the Token's term buffer (saves a buffer copy).
> Many thanks. I saw the comment, but missed getChars. This is better, but
> even better would be growTermBuffer (above) since resizeTermBuffer
> potentially has an unnecessary copy.

True, but remember growing should be very rare in 100% reuse (new) case.

>>> I was looking at this in light of TokenFilter's next(Token) method and how
>>> it was being used. In looking at the contrib filters, they have not been
>>> modified. Further, most of them, if they work with the content analysis and
>>> generation, do their work in strings. Some of these appear to be good
>>> candidates for using char[] rather than strings, such as the NGram filter.
>>> But others look like they'd just as well remain with String manipulation.
>> It would be great to upgrade all contrib filters to use the re-use APIs.
> I'll be happy to work toward that end. I think it affects my performance
> today.

That would be great :)

>>> I'd like to suggest that internally, that Token be changed to only use
>>> char[] termBuffer and eliminate termText.
>> The question is what performance cost we are incurring eg on the
>> contrib (& other) sources/filters?  Every time setTermText is called,
>> we copy out the chars (instead of holding a reference to the String).
>> Every time getText() is called we create a new String(...) from the
>> char[].  I think it's potentially a high cost, and so maybe we should
>> wait until 3.0 when we drop the deprecated APIs?
> See above. I'll concede that. But I think that once termBuffer is used
> because of a mixed chain, it should be retained.

This may cause problems if a core Tokenizer is used to produce the
tokens, but then a big chain of old (non-reuse) filters is used after

>>> And also, that termText be restored as not deprecated.
>> It made me nervous keeping this method because it looks like it should
>> be cheap to call, and in the past it was very cheap to call.  But,
>> maybe we could keep it, if we mark very very clearly in the javadocs
>> the performance cost you incur by using this method (it makes a new
>> String() every time)?
>>> But, in TokenFilter, next() should be deprecated, IMHO.
>> I think this is a good idea.  After all if people don't want to bother
>> using the passed in Token, they are still allowed to return a new
>> one.
> Should the constructors in Token that take a String be deprecated, too? The
> comments seem to suggest that.

Actually I think they should.  We should strongly encourage the re-use

>>> I have also used a better algorithm than doubling for resizing an
>>> array. I'd have to hunt for it.
>> That would be great!
> I'm looking, but still haven't found it. :)
> I'm looking into digging into this one.

Python has an interesting approach, for its list type.  Here's a
snippet from listobject.c from Python's sources:

	/* This over-allocates proportional to the list size, making room
	 * for additional growth.  The over-allocation is mild, but is
	 * enough to give linear-time amortized behavior over a long
	 * sequence of appends() in the presence of a poorly-performing
	 * system realloc().
	 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
	new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;

I'm sure Perl has something interesting too :)  But I think this is
probably overkill for us ...

> Based on the outcome of this discussion, would this be one Jira issue?
> Reopening LUCENE-969? A separate issue for the contrib changes?

I would say open a new issue?

> Since this would not be an API change, would there need to be changes to
> test cases? If so, what would you suggest?

I think the existing test cases are likely sufficient though I always
love adding new ones ;)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message