lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Updated: (LUCENE-1422) New TokenStream API
Date Wed, 29 Oct 2008 01:16:46 GMT


Michael Busch updated LUCENE-1422:

    Attachment: lucene-1422-take4.patch

Because mulitple people mentioned it would be better to merge
TokenStream and Token into one class, I thought more about it
and now I think I prefer that approach too. I implemented it
and added a class TokenStreamState to capture a state of a stream
which can be used for buffering of Tokens. The performance of the
CachingTokenFilter is lower than before if all attributes that the
Token had are used, but slightly better if fewer are used (which
was previously not possible). I also had some ideas of making 
buffering of Tokens perform better, but this patch is now 
already pretty long, so I decided to add better buffering with
a separate issue at a later point. Here is a sumamry of my 

Changes in the analysis package:
- Added the tokenattributes subpackage, known from the previous
- Added the abstract class AttributeSource that owns the
  attribute map and appropriate methods to get and add the
  attributes. TokenStream now extends AttributeSource.
- Deprecated the Token class.
- Added TokenStream#start(), TokenStream#initialize() and 
  TokenStream#incrementToken(). start() must be called by a 
  consumer before incrementToken() is called the first time.
  start() calls initialize(), which can be used by TokenStreams
  and TokenFilters to add or get attributes. I separated 
  start() and initialize() to enforce in TokenFilters that
  input.start() is called. I think it would be a pitfall for
  bugs (happened to me before I added initialize().
- Added another subclass of AttributeSource called 
  TokenStreamState which can be used to capture a current state
  of a TokenStream, e. g. for buffering purposes. I changed the
  CachingTokenFilter and Sink/Tee-TokenFilter to make use of 
  this new class.
- Changed all core TokenStreams and TokenFilters to implement
  the new methods and deprecated the next(Token) methods, but
  left them for backwards compatibility.

Changes in the indexer package:

- Changed DocInverterPerField.processFields to use the new API
  if TokenStream.useNewAPI() is set to true. I also added an 
  inner class to DocInverterPerThread called 
  BackwardsCompatibilityStream that allows me to set a Token
  and all Attributes just return the values from the token.
  That allows me to change all consuemrs in the indexer 
  package to not use Token anymore at all, but only 
  TokenStream, without a performance hit.
- Added start(Fieldable) method to InvertedDocConsumerPerField
  and TermsHashConsumerPerField that is called to notify the 
  consumers that one field *instance* is now going to be 
  processed, so that they can get the attribute references
  from DocInverter.FieldInvertState.attributeSource. Also 
  changed the signature of the add() method of the above 
  mentioned classes to not take a Token anymore.
Changes in queryparser package:
- Changed QueryParser so that it uses a CachingTokenFilter 
  instead of a List to buffer tokens. 


- Added class TokenStreamTestUtils to the analysis test 
  package which contains two inner helper classes:
  BackwardsCompatibleStream and BackwardsCompatibleFilter.
  Both overwrite TokenStream#next(Token) and call 
  incrementToken() and then copy all attribute values to the
  Token to be returned to the caller of next(Token). That 
  simplifies it to make existing tests run in old and new API

All test cases pass with useNewAPI=true and false. I think 
this patch is mostly done now (I just have to update 
analysis/package.html and cleanup imports), unless we're not
happy with the APIs. 
Please give me some feedback about this approach.

> New TokenStream API
> -------------------
>                 Key: LUCENE-1422
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: lucene-1422-take4.patch, lucene-1422.patch, lucene-1422.take2.patch,
lucene-1422.take3.patch, lucene-1422.take3.patch
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message