lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1422) New TokenStream API
Date Thu, 13 Nov 2008 17:20:44 GMT


Michael McCandless commented on LUCENE-1422:

Looks good!

I'm seeing this failure (in test I just committed this AM).  I think
it's OK, because the new API is enabled for all tests and I'm using
the old API with that analyzer?

    [junit] Testcase: testExclusiveLowerNull(	Caused
    [junit] This token does not have the attribute 'class org.apache.lucene.analysis.tokenattributes.TermAttribute'.
    [junit] java.lang.IllegalArgumentException: This token does not have the attribute 'class
    [junit] 	at org.apache.lucene.util.AttributeSource.getAttribute(
    [junit] 	at org.apache.lucene.index.TermsHashPerField.start(
    [junit] 	at org.apache.lucene.index.DocInverterPerField.processFields(
    [junit] 	at org.apache.lucene.index.DocFieldConsumersPerField.processFields(
    [junit] 	at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(
    [junit] 	at org.apache.lucene.index.DocumentsWriter.updateDocument(
    [junit] 	at org.apache.lucene.index.DocumentsWriter.addDocument(
    [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(
    [junit] 	at org.apache.lucene.index.IndexWriter.addDocument(
    [junit] 	at
    [junit] 	at
    [junit] 	at

Some other random questions:

  * I'm a little confused by 'start' and 'initialize' in TokenStream.
    DocInverterPerField (consumer of this API) calls
    analyzer.reusableTokenStream to get a token stream.  Then it calls
    stream.reset().  Then it calls stream.start() which then calls
    stream.initialize().  Can we consolidate these (there are 4 places
    that we call to "start" the stream now)?
    EG why can't analyzer.reusableTokenStream() do the init internally
    in the new API?  (Also, in StopFilter, initialize() sets termAtt &
    posIncrAtt, but I would think this only needs to happen once when
    that TokenFilter is created?  BackCompatTokenStream add the attrs
    in its ctor, which seems better.).

  * BackCompatTokenStream is calling attributes.put directly but all
    others use super's addAttribute.

  * Why is BackCompatTokenStream overriding so many methods?  EG
    has/get/addAttribute -- won't super do the same thing?

  * Maybe add reasons to some of the asserts, eg StopFilter has
    "assert termAtt != null", so maybe append to that ": initialize()
    wasn't called".

> New TokenStream API
> -------------------
>                 Key: LUCENE-1422
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: lucene-1422-take4.patch, lucene-1422-take5.patch, lucene-1422.patch,
lucene-1422.take2.patch, lucene-1422.take3.patch, lucene-1422.take3.patch
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class<? extends Attribute> -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message