lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: Token implementation
Date Tue, 20 May 2008 09:21:15 GMT

On May 20, 2008, at 5:01 AM, Michael McCandless wrote:

> DM Smith wrote:
>> On May 19, 2008, at 4:33 PM, Michael McCandless wrote:
>>> DM Smith <> wrote:
>>>> Michael McCandless wrote:
>>>>> I agree the situation is not ideal, and it's confusing.
>>>> My problem as a user is that I have to read the code to figure  
>>>> out how to
>>>> optimally use the class. The JavaDoc is a bit wanting.
>>> Yeah we should fix the javadocs.
>>>>> This comes back to LUCENE-969.
>>>>> At the time, we decided to keep both String & char[] only to avoid
>>>>> performance cost for those analyzer chains that use String tokens
>>>>> exclusively.
>>>>> The idea was to allow Token to keep both text or char[] and  
>>>>> sometimes
>>>>> both (if they are storing the same characters, as happens if
>>>>> termBuffer() is called when it's a String being stored)
>>>> When termBuffer() is called termText is always null on return.  
>>>> This is the
>>>> invariant of initTermBuffer() which is called directly or  
>>>> indirectly by
>>>> every method that touches termBuffer.
>>> Sorry I meant termText().
>>>> It is only after calling termText() that one could have both. The  
>>>> only
>>>> advantage I see here is that calling it twice without any  
>>>> intervening call
>>>> to a termBuffer method would it return the same physical string.
>>> Right.
>>>> After calling setTermText(newText), termBuffer is set to null.
>>>> I presume the purpose of a filter is to get the token content,  
>>>> and if
>>>> modified, set the new content. If so, the result of the setXXX  
>>>> will be that
>>>> either termText or termBuffer will be null.
>>> Right, though, if there's no change, and the next filter in the  
>>> chain
>>> calls termText(), we save constructing a new String by caching it.
>>>>> Then, in 3.0, we would make the change you are proposing (to only
>>>>> store char[] internally).  That was the plan, anyway.   
>>>>> Accelerating
>>>>> this plan (to store only char[] today) is compelling, but I worry
>>>>> about the performance hit to legacy analyzer chains...
>>>> I wonder whether it is all that significant an issue. Today, some  
>>>> of the
>>>> Lucene analyzers have been migrated to using char[], while  
>>>> others, notably
>>>> contrib, continue to use text.
>>>> IMHO: Prior to char[], the text was replaceable, but not  
>>>> modifiable. From a
>>>> practical perspective, Token reuse minimized the cost of  
>>>> construction, but
>>>> not much else. The performance of a Token was predictable, but  
>>>> the filter
>>>> was potentially sub-optimal. With char[] and supporting methods,  
>>>> the text
>>>> became modifiable, too.
>>>> When a filter calls setTermText or setTermBuffer, it does not  
>>>> know how the
>>>> consumer of the token will work. It could be that it stores it with
>>>> setTermText and the next filter calls termBuffer().
>>>> I may not understand this correctly, but it seems to me that the  
>>>> following
>>>> is plausible given a filter chain of Lucene provided filters  
>>>> (including contrib)
>>>> If we have a token filter chain of A -> B -> C, which uses next() 

>>>> in any
>>>> part of the chain, the flow of a reusable Token is stopped. A  
>>>> given filter
>>>> may cache a Token and reuse it. So consider the following scenario:
>>>> A overrides next(Token) and reuses the token via char[]
>>>> B overrides next() and has a cached Token and updates text.
>>>> C overrides next(Token) and reuses the token via char[].
>>>> First run:
>>>> After A is done, the termText in the supplied Token is null and  
>>>> termBuffer
>>>> has a value.
>>>> B provides it's own token so it is not influenced by A.
>>>> C is given the token that B returns, because the caller is  
>>>> effectively using
>>>> "token =", but because the token has text, a  
>>>> performance
>>>> hit is taken to put it into char[]. Both text and char[] start  
>>>> out the same,
>>>> but because char[] is changed, termText is set to null.
>>>> Second run:
>>>> A starts with a Token with a char[] because it is reusing the  
>>>> token from the
>>>> last run or because it is using a localToken from the caller. If  
>>>> it is a
>>>> localToken, then the scenario is as above. But if it is the end  
>>>> result of
>>>> the first run, then A is re-using the token that is cached in B.  
>>>> Since C
>>>> last modified it, it is char[].
>>>> B uses its cached Token, but it was modified by A to be char[]  
>>>> with null
>>>> text. Now B takes a performance hit as it creates a new String.
>>>> C is as in the first run.
>>>> Another scenario:
>>>> A, B and C are all legacy. This would only be filter chains that  
>>>> are not
>>>> provided by core Lucene as the core filter chains have been  
>>>> migrated. This
>>>> would be a performance hit.
>>> Why is this a performance hit?  If they are all legacy they would  
>>> all
>>> implemented the non-reuse next() API, and would use setTermText, and
>>> no conversion to char[] would happen (except inside  
>>> DocumentsWriter)?
>> I was thinking faster than I was typing. I meant to say "not be a  
>> performance hit".
> Ahh, ok.
>>>> A last scenario:
>>>> A, B and C are all char[]. This would not take a performance hit.
>>>> It seems to me that in a mixed chain, that there will always be a
>>>> performance hit.
>>> Right, though since we cache the String we should save on multiple
>>> calls to termText().  And in a non-mixed chain (all new or all  
>>> old) I
>>> think there wouldn't be a hit.
>>>> But my guess is that maintaining termBuffer once used would
>>>> be a good thing.
>>> Interesting...
>>>> So, a modified suggestion to maintain the performance but improve  
>>>> the first
>>>> scenario. Do you see any problem with the following?
>>>> If termBuffer is used in a token, then it is maintained and never  
>>>> set to  null.
>>>> Note also that resizeTermBuffer(size) maintains the termBuffer.  
>>>> That is, it
>>>> copies the text when the array is grown. When it is known that it  
>>>> is going
>>>> to be slammed this is unnecessary. One can implement a helper  
>>>> function that
>>>> merely grows the array. There are a couple of places this
>>>> Thus setTokenText would be something like:
>>>> public void setTermText(String text) {
>>>> termText = text;
>>>> if (termBuffer != null) {
>>>>   growTermBuffer(termText.length());
>>>>   termText.getChars(0, termText.length(), termBuffer, 0);
>>>> }
>>>> }
>>>> A possible implementation to grow the array would be:
>>>> private void growTermBuffer(int newSize)
>>>> {
>>>> // determine the best size
>>>> if (newSize < MIN_BUFFER_SIZE) {
>>>> newSize = MIN_BUFFER_SIZE;
>>>> }
>>>> // if the buffer exists and is too small, then determine a better  
>>>> size.
>>>> // this is the current doubling algorithm. it could be better.
>>>> int tbLength = termBuffer == null ? 0 : termBuffer.length;
>>>> if (tbLength > 0 && newSize > tbLength) {
>>>> int size = tbLength;
>>>> while (size < newSize) {
>>>>    size *= 2;
>>>> }
>>>>  newSize = size;
>>>> }
>>>> // Check to see if the buffer needs to be resized
>>>> if (newSize > tbLength)
>>>> {
>>>> termBuffer = new char[newSize];
>>>> }
>>>> }
>>> That looks good.  Though, if Token is 100% re-used (full chain is
>>> "new") then I would expect growing the char[] to be very low cost
>>> (happens very few times).
>> Without peeking at your Python sample below ;) the doubling  
>> algorithm is costly as it over allocates too much. But it has the  
>> nice behavior that reallocations are fewer.
>> But, now peeking ahead, a less aggressive allocation policy would  
>> cause more allocations and the array copy becomes more important.  
>> Though not significantly. But in a mixed environment with some  
>> sharing, it could be magnified.
>> I didn't dig into it, but in the core, are reusable Tokens per  
>> document or per DocumentWriter. If it is the former then the copy  
>> becomes more significant. If it is the latter, then it is almost  
>> moot, as you point out.
> It's actually per DocumentsWriter per thread per field.

This is excellent.

>>>> More below....
>>>>> More responses below:
>>>>> DM Smith <> wrote:
>>>>>> I think the Token implementation as it stands can use some  
>>>>>> improvement and
>>>>>> I'd be willing to do it. I'd like some input, though.  
>>>>>> Especially because it
>>>>>> is core to Lucene.
>>>>>> I've been working on eliminating deprecations from my user code 

>>>>>> and I ran
>>>>>> across Token.getText() as being deprecated.
>>>>>> This is not about my code, but the code in Token.
>>>>>> In Token, it allows one of two representations of the actual  
>>>>>> token, either
>>>>>> an immutable String, or a mutable char[]. One can flip back and 

>>>>>> forth
>>>>>> between these all to easily.
>>>>>> termText() is deprecated so termBuffer() is suggested as a  
>>>>>> replacement.
>>>>>> Calling termBuffer() will potentially copy the text out of the  
>>>>>> String
>>>>>> termText and into the newly created buffer and return it.
>>>>>> Calling setTermText(str), which is not deprecated, will drop  
>>>>>> the buffer and
>>>>>> save the str in termText.
>>>>>> It appears that the invariant that is trying to be established is
>>>>>> either termText or termBuffer holds the content, but not both.
>>>>>> However, termBuffer() potentially violates this by loading  
>>>>>> termText
>>>>>> with the termBuffer, but not nulling out the buffer.
>>>>> Actually, both are allowed to be set, as long as they are the  
>>>>> same.
>>>>> So termBuffer() is allowed to leave both non-null.
>>>> I'm not sure of the advantage.
>>> Covered above (mixed chains).
>>>>>> Also, in my code, I am not manipulating char[] so getting the  
>>>>>> buffer, I need
>>>>>> to immediately convert it to a string to process it. And then  
>>>>>> when I'm done,
>>>>>> I have a String possibly of some other length. To stuff it back 

>>>>>> into the
>>>>>> termBuffer, requires a call to:
>>>>>> setTermBuffer(s.toCharArray(), o, s.length())
>>>>> It would be better to call Token.resizeTermBuffer(...), then
>>>>> s.getChars into the Token's term buffer (saves a buffer copy).
>>>> Many thanks. I saw the comment, but missed getChars. This is  
>>>> better, but
>>>> even better would be growTermBuffer (above) since resizeTermBuffer
>>>> potentially has an unnecessary copy.
>>> True, but remember growing should be very rare in 100% reuse (new)  
>>> case.
>> I'm presuming then that the DocumentWriter is the holder of the one  
>> reusable Token.
>>>>>> I was looking at this in light of TokenFilter's next(Token)  
>>>>>> method and how
>>>>>> it was being used. In looking at the contrib filters, they have 

>>>>>> not been
>>>>>> modified. Further, most of them, if they work with the content  
>>>>>> analysis and
>>>>>> generation, do their work in strings. Some of these appear to  
>>>>>> be good
>>>>>> candidates for using char[] rather than strings, such as the  
>>>>>> NGram filter.
>>>>>> But others look like they'd just as well remain with String  
>>>>>> manipulation.
>>>>> It would be great to upgrade all contrib filters to use the re- 
>>>>> use APIs.
>>>> I'll be happy to work toward that end. I think it affects my  
>>>> performance
>>>> today.
>>> That would be great :)
>>>>>> I'd like to suggest that internally, that Token be changed to  
>>>>>> only use
>>>>>> char[] termBuffer and eliminate termText.
>>>>> The question is what performance cost we are incurring eg on the
>>>>> contrib (& other) sources/filters?  Every time setTermText is  
>>>>> called,
>>>>> we copy out the chars (instead of holding a reference to the  
>>>>> String).
>>>>> Every time getText() is called we create a new String(...) from  
>>>>> the
>>>>> char[].  I think it's potentially a high cost, and so maybe we  
>>>>> should
>>>>> wait until 3.0 when we drop the deprecated APIs?
>>>> See above. I'll concede that. But I think that once termBuffer is  
>>>> used
>>>> because of a mixed chain, it should be retained.
>>> This may cause problems if a core Tokenizer is used to produce the
>>> tokens, but then a big chain of old (non-reuse) filters is used  
>>> after
>>> that?
>> If the reusable token is per DocumentWriter and the chain is per  
>> Document, then it is an advantage to retain it. he buffer is there  
>> for the next Document, likely big enough that it does not need to  
>> be allocated again.
> Well, remember that a single non-reuse filter in the chain "breaks"  
> the sharing, going backwards.  IE the new token source will see a  
> different Token with every next() call.  So the buffer doesn't get  
> reused.  So I think in this case we would spend alot of extra time  
> converting the Strings in the non-reuse filters back to char[].

IIRC, I saw somewhere, in an outer driving loop something like the  

result =

If my memory is correct, then reuse is re-established each time  
through the loop. That localToken is presumably using char[] to set  
initial content. Or at least once all core and contrib is converted to  
work with it.

>>>>>> And also, that termText be restored as not deprecated.
>>>>> It made me nervous keeping this method because it looks like it  
>>>>> should
>>>>> be cheap to call, and in the past it was very cheap to call.  But,
>>>>> maybe we could keep it, if we mark very very clearly in the  
>>>>> javadocs
>>>>> the performance cost you incur by using this method (it makes a  
>>>>> new
>>>>> String() every time)?
>>>>>> But, in TokenFilter, next() should be deprecated, IMHO.
>>>>> I think this is a good idea.  After all if people don't want to  
>>>>> bother
>>>>> using the passed in Token, they are still allowed to return a new
>>>>> one.
>>>> Should the constructors in Token that take a String be  
>>>> deprecated, too? The
>>>> comments seem to suggest that.
>>> Actually I think they should.  We should strongly encourage the re- 
>>> use
>>> APIs.
>>>>>> I have also used a better algorithm than doubling for resizing an
>>>>>> array. I'd have to hunt for it.
>>>>> That would be great!
>>>> I'm looking, but still haven't found it. :)
>>>> I'm looking into digging into this one.
>>> Python has an interesting approach, for its list type.  Here's a
>>> snippet from listobject.c from Python's sources:
>>> 	/* This over-allocates proportional to the list size, making room
>>> 	 * for additional growth.  The over-allocation is mild, but is
>>> 	 * enough to give linear-time amortized behavior over a long
>>> 	 * sequence of appends() in the presence of a poorly-performing
>>> 	 * system realloc().
>>> 	 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
>>> 	 */
>>> 	new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;
>> This is a good one. It is computationally simple and it is not too  
>> aggressive.
>>> I'm sure Perl has something interesting too :)  But I think this is
>>> probably overkill for us ...
>>>> Based on the outcome of this discussion, would this be one Jira  
>>>> issue?
>>>> Reopening LUCENE-969? A separate issue for the contrib changes?
>>> I would say open a new issue?
>> OK. I'm on vacation for the next week and I don't know what kind of  
>> connectivity I'll have. So it might wait a bit.
>>>> Since this would not be an API change, would there need to be  
>>>> changes to
>>>> test cases? If so, what would you suggest?
>>> I think the existing test cases are likely sufficient though I  
>>> always
>>> love adding new ones ;)
>> Me too ;)
>> -- DM
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message