lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 19:33:14 GMT
Something is wrong; I'm not sure what offhand, but calling peekToken
10 times should not stack all tokens @ position 0; it should stack the
tokens at the positions where they occurred.  Are you sure the posIncr
att is sometimes 1 (i.e., the position is in fact moving forward for
some tokens)?

nextToken() only calls peekToken() once the lookahead buffer is exhausted.

afterPosition() should be called within nextToken(), for each
position, once all tokens leaving that position are done.

You use case *should* be working: inside your incrementToken() you
call peekToken() over and over until you've seen the full sentence
(saving away any state in your subclass of Position), then nextToken()
to emit the buffered tokens, and to insert your own tokens when
afterPosition() is called ...

Mike McCandless

http://blog.mikemccandless.com


On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies <benson@basistech.com> wrote:
> nextToken() calls peekToken(). That seems to prevent my lookahead
> processing from seeing that item later. Am I missing something?
>
>
> On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies <benson@basistech.com> wrote:
>> I think that the penny just dropped, and I should not be using this class.
>>
>> If I call peekToken 10 times while sitting at token 0, this class will
>> stack up all 10 of these _at token position 0_. That's not really very
>> helpful for what I'm doing. I need to borrow code from this class and
>> not use it.
>>
>> On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies <benson@basistech.com> wrote:
>>> Michael,
>>>
>>> I'm apparently not fully deconfused yet.
>>>
>>> I've got a very simple incrementToken function. It calls peekToken to
>>> stack up the tokens.
>>>
>>> afterPosition is never called; I expected it to be called as each of
>>> the peeked tokens gets next-ed back out.
>>>
>>> I assume that I'm missing something simple.
>>>
>>>
>>>     public boolean incrementToken() throws IOException {
>>>         if (positions.getMaxPos() < 0) {
>>>             peekSentence();
>>>         }
>>>         return nextToken();
>>>     }
>>>
>>>
>>>
>>> On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com>
wrote:
>>>> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
>>>> <lucene@mikemccandless.com> wrote:
>>>>>
>>>>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>> > I'm trying to work through the logic of reading ahead until I've
seen
>>>>> > marker for the end of a sentence, then applying some analysis to
all of the
>>>>> > tokens of the sentence, and then changing some attributes of each
token to
>>>>> > reflect the results.
>>>>> >
>>>>> > The queue of tokens for a position is just a State, so there isn't
an API
>>>>> > there to set any values.
>>>>> >
>>>>> > So do I need to subclass Position for myself, store the additional
>>>>> > information in there, and set the attributes as each token comes
by on the
>>>>> > output side?
>>>>>
>>>>> Yes, that sounds right.  Either that or, on emitting the eventual
>>>>> Tokens, apply your logic there (because at that point, after
>>>>> restoreState, you have access to all the attr values for that token).
>>>>>
>>>>> > I would be grateful for a bit more explanation of afterPosition
versus
>>>>> > incrementToken; some of the mock classes call peek from afterPosition,
and
>>>>> > I expected to see peek called in incrementToken based on the javadoc.
>>>>>
>>>>> afterPosition is where your subclass can "insert" new tokens.
>>>>>
>>>>> I think (it's been a while here...) you are allowed to call peekToken
>>>>> in afterPosition; this is necessary if your logic about inserting
>>>>> additional tokens leaving a given position depends on future tokens.
>>>>>
>>>>> But: are you doing any new token insertion?  Or are you just tweaking
>>>>> the attributes of the tokens that pass through the filter?  If it's
>>>>> the latter then this class may be overkill ... you could make a simple
>>>>> TokenFilter.incrementToken that just enumerates & saves all input
>>>>> tokens, does its processing, then returns those tokens one by one,
>>>>> instead.
>>>>
>>>> I'm not adding tokens yet, but I will be soon, so all of this isn't
>>>> entirely crazy. The underlying capability here includes decompounding.
>>>> (I have mixed feelings about just adding all the fragments to the
>>>> token stream, as it can reduce precision, but there isn't an obvious
>>>> alternative (except perhaps to suppress the super-common ones)).
>>>>
>>>> So, to summarize, logic might be:
>>>>
>>>> in incrementToken:
>>>>
>>>> If positions.getMaxPos() > -1. just return nextToken(). If not, loop
>>>> calling peekToken to acquire a sentence, process the sentence, and
>>>> attach the lemmas and compound-pieces to the Position subclass
>>>> objects.
>>>>
>>>> in afterPosition, as each token comes 'into focus', splat the lemma
>>>> from the Position into the char term attribute, and insert new tokens
>>>> as needed for the compound components.
>>>>
>>>> Thanks,
>>>> benson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message