lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 17:10:43 GMT
nextToken() calls peekToken(). That seems to prevent my lookahead
processing from seeing that item later. Am I missing something?


On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies <benson@basistech.com> wrote:
> I think that the penny just dropped, and I should not be using this class.
>
> If I call peekToken 10 times while sitting at token 0, this class will
> stack up all 10 of these _at token position 0_. That's not really very
> helpful for what I'm doing. I need to borrow code from this class and
> not use it.
>
> On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies <benson@basistech.com> wrote:
>> Michael,
>>
>> I'm apparently not fully deconfused yet.
>>
>> I've got a very simple incrementToken function. It calls peekToken to
>> stack up the tokens.
>>
>> afterPosition is never called; I expected it to be called as each of
>> the peeked tokens gets next-ed back out.
>>
>> I assume that I'm missing something simple.
>>
>>
>>     public boolean incrementToken() throws IOException {
>>         if (positions.getMaxPos() < 0) {
>>             peekSentence();
>>         }
>>         return nextToken();
>>     }
>>
>>
>>
>> On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com> wrote:
>>> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
>>> <lucene@mikemccandless.com> wrote:
>>>>
>>>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>> > I'm trying to work through the logic of reading ahead until I've seen
>>>> > marker for the end of a sentence, then applying some analysis to all
of the
>>>> > tokens of the sentence, and then changing some attributes of each token
to
>>>> > reflect the results.
>>>> >
>>>> > The queue of tokens for a position is just a State, so there isn't an
API
>>>> > there to set any values.
>>>> >
>>>> > So do I need to subclass Position for myself, store the additional
>>>> > information in there, and set the attributes as each token comes by
on the
>>>> > output side?
>>>>
>>>> Yes, that sounds right.  Either that or, on emitting the eventual
>>>> Tokens, apply your logic there (because at that point, after
>>>> restoreState, you have access to all the attr values for that token).
>>>>
>>>> > I would be grateful for a bit more explanation of afterPosition versus
>>>> > incrementToken; some of the mock classes call peek from afterPosition,
and
>>>> > I expected to see peek called in incrementToken based on the javadoc.
>>>>
>>>> afterPosition is where your subclass can "insert" new tokens.
>>>>
>>>> I think (it's been a while here...) you are allowed to call peekToken
>>>> in afterPosition; this is necessary if your logic about inserting
>>>> additional tokens leaving a given position depends on future tokens.
>>>>
>>>> But: are you doing any new token insertion?  Or are you just tweaking
>>>> the attributes of the tokens that pass through the filter?  If it's
>>>> the latter then this class may be overkill ... you could make a simple
>>>> TokenFilter.incrementToken that just enumerates & saves all input
>>>> tokens, does its processing, then returns those tokens one by one,
>>>> instead.
>>>
>>> I'm not adding tokens yet, but I will be soon, so all of this isn't
>>> entirely crazy. The underlying capability here includes decompounding.
>>> (I have mixed feelings about just adding all the fragments to the
>>> token stream, as it can reduce precision, but there isn't an obvious
>>> alternative (except perhaps to suppress the super-common ones)).
>>>
>>> So, to summarize, logic might be:
>>>
>>> in incrementToken:
>>>
>>> If positions.getMaxPos() > -1. just return nextToken(). If not, loop
>>> calling peekToken to acquire a sentence, process the sentence, and
>>> attach the lemmas and compound-pieces to the Position subclass
>>> objects.
>>>
>>> in afterPosition, as each token comes 'into focus', splat the lemma
>>> from the Position into the char term attribute, and insert new tokens
>>> as needed for the compound components.
>>>
>>> Thanks,
>>> benson
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message