lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1693) AttributeSource/TokenStream API improvements
Date Fri, 10 Jul 2009 21:03:15 GMT


Michael McCandless commented on LUCENE-1693:

I'm still trying to wrap my brain around this issue :)

So to sum up where things [seem] to stand here:

  * We are switching to separate interface + impl for token
    attributes.  This is nice because a single class (eg
    can provide multiple attrs, it makes cloning more efficient, etc.

  * We are keeping good old around; it simply implements all
    the attrs, and is very convenient to use.

I think this is a great improvement to the new tokenStream API.  It's
also a sizable perf gain to clone only one impl that has only the
attrs you care about.

Uwe then extended the original patch:

  * Use reflection on init'ing a TokenStream to determine which of the
    3 APIs (old, newer, newest) it implements

  * Indexer (and in general any consumer) now only has to consume the
    new API.  Any parts of the chain that aren't new are automatically

  * Core tokenizers/filters (and contrib, when we get to them) only
    implement the new API; old API is "wrapped on demand" if needed

  * One can mix & match old, new tokenizers, though at some perf cost.
    But that's actually OK since the original patch it's "all or
    none", ie your chain must be entirely new or old

  * I think (?) a chain of all-old-tokenizer/filters and all-new
    wouldn't see a perf hit?  Wrapping only happens when there's a
    mismatch b/w two filters in the chain?

I'm tentatively optimistic about the extended patch... (though these
back-compat, corner cases, etc., are real hard to think about).  I
agree it's doing alot of "magic" under the hood to figure out how to
best wrap things, but the appeal of only implementing the new api in
core/contrib tokenizers, only consuming new, being free to mix&match,
etc, is strong.

One concern is: TokenStream.initialize looks spookily heavy weight; eg
I don't know the "typical" cost of reflection.  I think there are
likely many apps out there that make a new TokenStream for ever field
that's analyzed (ie implement Analyzer.tokenStream not
reusableTokenStream) and this (introspection every time) could be a
sizable perf hit.  Uwe was this included in your perf tests?

> AttributeSource/TokenStream API improvements
> --------------------------------------------
>                 Key: LUCENE-1693
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch,
LUCENE-1693.patch, LUCENE-1693.patch, LUCENE-1693.patch, lucene-1693.patch,,,,
> This patch makes the following improvements to AttributeSource and
> TokenStream/Filter:
> - removes the set/getUseNewAPI() methods (including the standard
>   ones). Instead by default incrementToken() throws a subclass of
>   UnsupportedOperationException. The indexer tries to call
>   incrementToken() initially once to see if the exception is thrown;
>   if so, it falls back to the old API.
> - introduces interfaces for all Attributes. The corresponding
>   implementations have the postfix 'Impl', e.g. TermAttribute and
>   TermAttributeImpl. AttributeSource now has a factory for creating
>   the Attribute instances; the default implementation looks for
>   implementing classes with the postfix 'Impl'. Token now implements
>   all 6 TokenAttribute interfaces.
> - new method added to AttributeSource:
>   addAttributeImpl(AttributeImpl). Using reflection it walks up in the
>   class hierarchy of the passed in object and finds all interfaces
>   that the class or superclasses implement and that extend the
>   Attribute interface. It then adds the interface->instance mappings
>   to the attribute map for each of the found interfaces.
> - AttributeImpl now has a default implementation of toString that uses
>   reflection to print out the values of the attributes in a default
>   formatting. This makes it a bit easier to implement AttributeImpl,
>   because toString() was declared abstract before.
> - Cloning is now done much more efficiently in
>   captureState. The method figures out which unique AttributeImpl
>   instances are contained as values in the attributes map, because
>   those are the ones that need to be cloned. It creates a single
>   linked list that supports deep cloning (in the inner class
>   AttributeSource.State). AttributeSource keeps track of when this
>   state changes, i.e. whenever new attributes are added to the
>   AttributeSource. Only in that case will captureState recompute the
>   state, otherwise it will simply clone the precomputed state and
>   return the clone. restoreState(AttributeSource.State) walks the
>   linked list and uses the copyTo() method of AttributeImpl to copy
>   all values over into the attribute that the source stream
>   (e.g. SinkTokenizer) uses. 
> The cloning performance can be greatly improved if not multiple
> AttributeImpl instances are used in one TokenStream. A user can
> e.g. simply add a Token instance to the stream instead of the individual
> attributes. Or the user could implement a subclass of AttributeImpl that
> implements exactly the Attribute interfaces needed. I think this
> should be considered an expert API (addAttributeImpl), as this manual
> optimization is only needed if cloning performance is crucial. I ran
> some quick performance tests using Tee/Sink tokenizers (which do
> cloning) and the performance was roughly 20% faster with the new
> API. I'll run some more performance tests and post more numbers then.
> Note also that when we add serialization to the Attributes, e.g. for
> supporting storing serialized TokenStreams in the index, then the
> serialization should benefit even significantly more from the new API
> than cloning. 
> Also, the TokenStream API does not change, except for the removal 
> of the set/getUseNewAPI methods. So the patches in LUCENE-1460
> should still work.
> All core tests pass, however, I need to update all the documentation
> and also add some unit tests for the new AttributeSource
> functionality. So this patch is not ready to commit yet, but I wanted
> to post it already for some feedback. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message