lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5202) LookaheadTokenFilter consumes an extra token in nextToken
Date Sun, 08 Sep 2013 13:24:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761264#comment-13761264
] 

Michael McCandless commented on LUCENE-5202:
--------------------------------------------

bq. There's a call to peekToken in nextToken used to detect the end of the input. When that
gets called, a token 'moves' from the input to the positions, so the calls to peekToken in
my code never see it.

OK I think I see.

So, your peekSentence has peek'd N tokens, up until it saw a '.' token.  Then, your incrementToken
does nextToken() to get through those buffered tokens, tweaking atts before returning, but
then on the first nextToken() after the lookahead buffer is exhausted, peekToken() is called
directly from nextToken() and you have no chance to intercept that.

But note that this token doesn't actually move to positions (get buffered); it just "passes
through", i.e. when nextToken returns the atts of that new token are "live" in the attributes
and you could examine it "live".

Or, maybe, you could use a counter, incremented as you peek tokens in peekSentence, and then
decremented as you nextToken() off the lookahead, and once that reaches 0 you peekSentence()
again?  Or, maybe LookaheadTF should do this for you, e.g. provide a lookaheadCount saying
how many tokens are in the lookahead buffer.

Net/net, it may be a lot easier to just make your own dedicated class :)  It would have direct
control over the buffer, so you wouldn't have to deal with the confusing flow of LookaheadTF.

                
> LookaheadTokenFilter consumes an extra token in nextToken
> ---------------------------------------------------------
>
>                 Key: LUCENE-5202
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5202
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.3.1
>            Reporter: Benson Margulies
>         Attachments: LUCENE-5202.patch, LUCENE-5202.patch
>
>
> This is a bit hard to explain except by looking at the test case. I've coded a filter
that uses LookaheadTokenFilter. The incrementToken method peeks some tokens. Then, it seems,
nextToken in the Lookahead class calls peekToken itself, which seems to me to consume a token
so that it's not seen when the derived class sets out to process the next set of tokens.
> In passing, this test case can be used to demonstrate that it does not work to try to
use the afterPosition method to set up attributes of the token that we're 'after'. Probably
that was never intended. However, I'm hoping for some feedback as to whether the rest of the
structure here is as intended for subclasses of LookaheadTokenFilter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message