lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Token implementation
Date Tue, 20 May 2008 09:01:49 GMT

DM Smith wrote:
> On May 19, 2008, at 4:33 PM, Michael McCandless wrote:
>> DM Smith <> wrote:
>>> Michael McCandless wrote:
>>>> I agree the situation is not ideal, and it's confusing.
>>> My problem as a user is that I have to read the code to figure  
>>> out how to
>>> optimally use the class. The JavaDoc is a bit wanting.
>> Yeah we should fix the javadocs.
>>>> This comes back to LUCENE-969.
>>>> At the time, we decided to keep both String & char[] only to avoid
>>>> performance cost for those analyzer chains that use String tokens
>>>> exclusively.
>>>> The idea was to allow Token to keep both text or char[] and  
>>>> sometimes
>>>> both (if they are storing the same characters, as happens if
>>>> termBuffer() is called when it's a String being stored)
>>> When termBuffer() is called termText is always null on return.  
>>> This is the
>>> invariant of initTermBuffer() which is called directly or  
>>> indirectly by
>>> every method that touches termBuffer.
>> Sorry I meant termText().
>>> It is only after calling termText() that one could have both. The  
>>> only
>>> advantage I see here is that calling it twice without any  
>>> intervening call
>>> to a termBuffer method would it return the same physical string.
>> Right.
>>> After calling setTermText(newText), termBuffer is set to null.
>>> I presume the purpose of a filter is to get the token content,  
>>> and if
>>> modified, set the new content. If so, the result of the setXXX  
>>> will be that
>>> either termText or termBuffer will be null.
>> Right, though, if there's no change, and the next filter in the chain
>> calls termText(), we save constructing a new String by caching it.
>>>> Then, in 3.0, we would make the change you are proposing (to only
>>>> store char[] internally).  That was the plan, anyway.  Accelerating
>>>> this plan (to store only char[] today) is compelling, but I worry
>>>> about the performance hit to legacy analyzer chains...
>>> I wonder whether it is all that significant an issue. Today, some  
>>> of the
>>> Lucene analyzers have been migrated to using char[], while  
>>> others, notably
>>> contrib, continue to use text.
>>> IMHO: Prior to char[], the text was replaceable, but not  
>>> modifiable. From a
>>> practical perspective, Token reuse minimized the cost of  
>>> construction, but
>>> not much else. The performance of a Token was predictable, but  
>>> the filter
>>> was potentially sub-optimal. With char[] and supporting methods,  
>>> the text
>>> became modifiable, too.
>>> When a filter calls setTermText or setTermBuffer, it does not  
>>> know how the
>>> consumer of the token will work. It could be that it stores it with
>>> setTermText and the next filter calls termBuffer().
>>> I may not understand this correctly, but it seems to me that the  
>>> following
>>> is plausible given a filter chain of Lucene provided filters  
>>> (including contrib)
>>> If we have a token filter chain of A -> B -> C, which uses next()  
>>> in any
>>> part of the chain, the flow of a reusable Token is stopped. A  
>>> given filter
>>> may cache a Token and reuse it. So consider the following scenario:
>>> A overrides next(Token) and reuses the token via char[]
>>> B overrides next() and has a cached Token and updates text.
>>> C overrides next(Token) and reuses the token via char[].
>>> First run:
>>> After A is done, the termText in the supplied Token is null and  
>>> termBuffer
>>> has a value.
>>> B provides it's own token so it is not influenced by A.
>>> C is given the token that B returns, because the caller is  
>>> effectively using
>>> "token =", but because the token has text, a  
>>> performance
>>> hit is taken to put it into char[]. Both text and char[] start  
>>> out the same,
>>> but because char[] is changed, termText is set to null.
>>> Second run:
>>> A starts with a Token with a char[] because it is reusing the  
>>> token from the
>>> last run or because it is using a localToken from the caller. If  
>>> it is a
>>> localToken, then the scenario is as above. But if it is the end  
>>> result of
>>> the first run, then A is re-using the token that is cached in B.  
>>> Since C
>>> last modified it, it is char[].
>>> B uses its cached Token, but it was modified by A to be char[]  
>>> with null
>>> text. Now B takes a performance hit as it creates a new String.
>>> C is as in the first run.
>>> Another scenario:
>>> A, B and C are all legacy. This would only be filter chains that  
>>> are not
>>> provided by core Lucene as the core filter chains have been  
>>> migrated. This
>>> would be a performance hit.
>> Why is this a performance hit?  If they are all legacy they would all
>> implemented the non-reuse next() API, and would use setTermText, and
>> no conversion to char[] would happen (except inside DocumentsWriter)?
> I was thinking faster than I was typing. I meant to say "not be a  
> performance hit".

Ahh, ok.

>>> A last scenario:
>>> A, B and C are all char[]. This would not take a performance hit.
>>> It seems to me that in a mixed chain, that there will always be a
>>> performance hit.
>> Right, though since we cache the String we should save on multiple
>> calls to termText().  And in a non-mixed chain (all new or all old) I
>> think there wouldn't be a hit.
>>> But my guess is that maintaining termBuffer once used would
>>> be a good thing.
>> Interesting...
>>> So, a modified suggestion to maintain the performance but improve  
>>> the first
>>> scenario. Do you see any problem with the following?
>>> If termBuffer is used in a token, then it is maintained and never  
>>> set to  null.
>>> Note also that resizeTermBuffer(size) maintains the termBuffer.  
>>> That is, it
>>> copies the text when the array is grown. When it is known that it  
>>> is going
>>> to be slammed this is unnecessary. One can implement a helper  
>>> function that
>>> merely grows the array. There are a couple of places this
>>> Thus setTokenText would be something like:
>>> public void setTermText(String text) {
>>>  termText = text;
>>>  if (termBuffer != null) {
>>>    growTermBuffer(termText.length());
>>>    termText.getChars(0, termText.length(), termBuffer, 0);
>>>  }
>>> }
>>> A possible implementation to grow the array would be:
>>> private void growTermBuffer(int newSize)
>>> {
>>> // determine the best size
>>> if (newSize < MIN_BUFFER_SIZE) {
>>>  newSize = MIN_BUFFER_SIZE;
>>> }
>>> // if the buffer exists and is too small, then determine a better  
>>> size.
>>> // this is the current doubling algorithm. it could be better.
>>> int tbLength = termBuffer == null ? 0 : termBuffer.length;
>>> if (tbLength > 0 && newSize > tbLength) {
>>>  int size = tbLength;
>>>  while (size < newSize) {
>>>     size *= 2;
>>>  }
>>>   newSize = size;
>>> }
>>> // Check to see if the buffer needs to be resized
>>> if (newSize > tbLength)
>>> {
>>>  termBuffer = new char[newSize];
>>> }
>>> }
>> That looks good.  Though, if Token is 100% re-used (full chain is
>> "new") then I would expect growing the char[] to be very low cost
>> (happens very few times).
> Without peeking at your Python sample below ;) the doubling  
> algorithm is costly as it over allocates too much. But it has the  
> nice behavior that reallocations are fewer.
> But, now peeking ahead, a less aggressive allocation policy would  
> cause more allocations and the array copy becomes more important.  
> Though not significantly. But in a mixed environment with some  
> sharing, it could be magnified.
> I didn't dig into it, but in the core, are reusable Tokens per  
> document or per DocumentWriter. If it is the former then the copy  
> becomes more significant. If it is the latter, then it is almost  
> moot, as you point out.

It's actually per DocumentsWriter per thread per field.

>>> More below....
>>>> More responses below:
>>>> DM Smith <> wrote:
>>>>> I think the Token implementation as it stands can use some  
>>>>> improvement and
>>>>> I'd be willing to do it. I'd like some input, though.  
>>>>> Especially because it
>>>>> is core to Lucene.
>>>>> I've been working on eliminating deprecations from my user code  
>>>>> and I ran
>>>>> across Token.getText() as being deprecated.
>>>>> This is not about my code, but the code in Token.
>>>>> In Token, it allows one of two representations of the actual  
>>>>> token, either
>>>>> an immutable String, or a mutable char[]. One can flip back and  
>>>>> forth
>>>>> between these all to easily.
>>>>> termText() is deprecated so termBuffer() is suggested as a  
>>>>> replacement.
>>>>> Calling termBuffer() will potentially copy the text out of the  
>>>>> String
>>>>> termText and into the newly created buffer and return it.
>>>>> Calling setTermText(str), which is not deprecated, will drop  
>>>>> the buffer and
>>>>> save the str in termText.
>>>>> It appears that the invariant that is trying to be established is
>>>>> either termText or termBuffer holds the content, but not both.
>>>>> However, termBuffer() potentially violates this by loading  
>>>>> termText
>>>>> with the termBuffer, but not nulling out the buffer.
>>>> Actually, both are allowed to be set, as long as they are the same.
>>>> So termBuffer() is allowed to leave both non-null.
>>> I'm not sure of the advantage.
>> Covered above (mixed chains).
>>>>> Also, in my code, I am not manipulating char[] so getting the  
>>>>> buffer, I need
>>>>> to immediately convert it to a string to process it. And then  
>>>>> when I'm done,
>>>>> I have a String possibly of some other length. To stuff it back  
>>>>> into the
>>>>> termBuffer, requires a call to:
>>>>> setTermBuffer(s.toCharArray(), o, s.length())
>>>> It would be better to call Token.resizeTermBuffer(...), then
>>>> s.getChars into the Token's term buffer (saves a buffer copy).
>>> Many thanks. I saw the comment, but missed getChars. This is  
>>> better, but
>>> even better would be growTermBuffer (above) since resizeTermBuffer
>>> potentially has an unnecessary copy.
>> True, but remember growing should be very rare in 100% reuse (new)  
>> case.
> I'm presuming then that the DocumentWriter is the holder of the one  
> reusable Token.
>>>>> I was looking at this in light of TokenFilter's next(Token)  
>>>>> method and how
>>>>> it was being used. In looking at the contrib filters, they have  
>>>>> not been
>>>>> modified. Further, most of them, if they work with the content  
>>>>> analysis and
>>>>> generation, do their work in strings. Some of these appear to  
>>>>> be good
>>>>> candidates for using char[] rather than strings, such as the  
>>>>> NGram filter.
>>>>> But others look like they'd just as well remain with String  
>>>>> manipulation.
>>>> It would be great to upgrade all contrib filters to use the re- 
>>>> use APIs.
>>> I'll be happy to work toward that end. I think it affects my  
>>> performance
>>> today.
>> That would be great :)
>>>>> I'd like to suggest that internally, that Token be changed to  
>>>>> only use
>>>>> char[] termBuffer and eliminate termText.
>>>> The question is what performance cost we are incurring eg on the
>>>> contrib (& other) sources/filters?  Every time setTermText is  
>>>> called,
>>>> we copy out the chars (instead of holding a reference to the  
>>>> String).
>>>> Every time getText() is called we create a new String(...) from the
>>>> char[].  I think it's potentially a high cost, and so maybe we  
>>>> should
>>>> wait until 3.0 when we drop the deprecated APIs?
>>> See above. I'll concede that. But I think that once termBuffer is  
>>> used
>>> because of a mixed chain, it should be retained.
>> This may cause problems if a core Tokenizer is used to produce the
>> tokens, but then a big chain of old (non-reuse) filters is used after
>> that?
> If the reusable token is per DocumentWriter and the chain is per  
> Document, then it is an advantage to retain it. he buffer is there  
> for the next Document, likely big enough that it does not need to  
> be allocated again.

Well, remember that a single non-reuse filter in the chain "breaks"  
the sharing, going backwards.  IE the new token source will see a  
different Token with every next() call.  So the buffer doesn't get  
reused.  So I think in this case we would spend alot of extra time  
converting the Strings in the non-reuse filters back to char[].

>>>>> And also, that termText be restored as not deprecated.
>>>> It made me nervous keeping this method because it looks like it  
>>>> should
>>>> be cheap to call, and in the past it was very cheap to call.  But,
>>>> maybe we could keep it, if we mark very very clearly in the  
>>>> javadocs
>>>> the performance cost you incur by using this method (it makes a new
>>>> String() every time)?
>>>>> But, in TokenFilter, next() should be deprecated, IMHO.
>>>> I think this is a good idea.  After all if people don't want to  
>>>> bother
>>>> using the passed in Token, they are still allowed to return a new
>>>> one.
>>> Should the constructors in Token that take a String be  
>>> deprecated, too? The
>>> comments seem to suggest that.
>> Actually I think they should.  We should strongly encourage the re- 
>> use
>> APIs.
>>>>> I have also used a better algorithm than doubling for resizing an
>>>>> array. I'd have to hunt for it.
>>>> That would be great!
>>> I'm looking, but still haven't found it. :)
>>> I'm looking into digging into this one.
>> Python has an interesting approach, for its list type.  Here's a
>> snippet from listobject.c from Python's sources:
>> 	/* This over-allocates proportional to the list size, making room
>> 	 * for additional growth.  The over-allocation is mild, but is
>> 	 * enough to give linear-time amortized behavior over a long
>> 	 * sequence of appends() in the presence of a poorly-performing
>> 	 * system realloc().
>> 	 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
>> 	 */
>> 	new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;
> This is a good one. It is computationally simple and it is not too  
> aggressive.
>> I'm sure Perl has something interesting too :)  But I think this is
>> probably overkill for us ...
>>> Based on the outcome of this discussion, would this be one Jira  
>>> issue?
>>> Reopening LUCENE-969? A separate issue for the contrib changes?
>> I would say open a new issue?
> OK. I'm on vacation for the next week and I don't know what kind of  
> connectivity I'll have. So it might wait a bit.
>>> Since this would not be an API change, would there need to be  
>>> changes to
>>> test cases? If so, what would you suggest?
>> I think the existing test cases are likely sufficient though I always
>> love adding new ones ;)
> Me too ;)
> -- DM
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message