lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 20:33:49 GMT
LUCENE-5202. It seems to show the problem of the extra peek. I'm still
struggling to make sense of the 'problem' of not always calling
afterPosition(); that may be entirely my own confusion.

On Sat, Sep 7, 2013 at 4:21 PM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> That would be awesome, thanks!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies <benson@basistech.com> wrote:
>> I think I had better build you a test case for this situation, and
>> attach it to a JIRA.
>>
>> On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>>> Something is wrong; I'm not sure what offhand, but calling peekToken
>>> 10 times should not stack all tokens @ position 0; it should stack the
>>> tokens at the positions where they occurred.  Are you sure the posIncr
>>> att is sometimes 1 (i.e., the position is in fact moving forward for
>>> some tokens)?
>>>
>>> nextToken() only calls peekToken() once the lookahead buffer is exhausted.
>>>
>>> afterPosition() should be called within nextToken(), for each
>>> position, once all tokens leaving that position are done.
>>>
>>> You use case *should* be working: inside your incrementToken() you
>>> call peekToken() over and over until you've seen the full sentence
>>> (saving away any state in your subclass of Position), then nextToken()
>>> to emit the buffered tokens, and to insert your own tokens when
>>> afterPosition() is called ...
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>> nextToken() calls peekToken(). That seems to prevent my lookahead
>>>> processing from seeing that item later. Am I missing something?
>>>>
>>>>
>>>> On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>> I think that the penny just dropped, and I should not be using this class.
>>>>>
>>>>> If I call peekToken 10 times while sitting at token 0, this class will
>>>>> stack up all 10 of these _at token position 0_. That's not really very
>>>>> helpful for what I'm doing. I need to borrow code from this class and
>>>>> not use it.
>>>>>
>>>>> On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>> Michael,
>>>>>>
>>>>>> I'm apparently not fully deconfused yet.
>>>>>>
>>>>>> I've got a very simple incrementToken function. It calls peekToken
to
>>>>>> stack up the tokens.
>>>>>>
>>>>>> afterPosition is never called; I expected it to be called as each
of
>>>>>> the peeked tokens gets next-ed back out.
>>>>>>
>>>>>> I assume that I'm missing something simple.
>>>>>>
>>>>>>
>>>>>>     public boolean incrementToken() throws IOException {
>>>>>>         if (positions.getMaxPos() < 0) {
>>>>>>             peekSentence();
>>>>>>         }
>>>>>>         return nextToken();
>>>>>>     }
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
>>>>>>> <lucene@mikemccandless.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>>> > I'm trying to work through the logic of reading ahead
until I've seen
>>>>>>>> > marker for the end of a sentence, then applying some
analysis to all of the
>>>>>>>> > tokens of the sentence, and then changing some attributes
of each token to
>>>>>>>> > reflect the results.
>>>>>>>> >
>>>>>>>> > The queue of tokens for a position is just a State,
so there isn't an API
>>>>>>>> > there to set any values.
>>>>>>>> >
>>>>>>>> > So do I need to subclass Position for myself, store
the additional
>>>>>>>> > information in there, and set the attributes as each
token comes by on the
>>>>>>>> > output side?
>>>>>>>>
>>>>>>>> Yes, that sounds right.  Either that or, on emitting the
eventual
>>>>>>>> Tokens, apply your logic there (because at that point, after
>>>>>>>> restoreState, you have access to all the attr values for
that token).
>>>>>>>>
>>>>>>>> > I would be grateful for a bit more explanation of afterPosition
versus
>>>>>>>> > incrementToken; some of the mock classes call peek from
afterPosition, and
>>>>>>>> > I expected to see peek called in incrementToken based
on the javadoc.
>>>>>>>>
>>>>>>>> afterPosition is where your subclass can "insert" new tokens.
>>>>>>>>
>>>>>>>> I think (it's been a while here...) you are allowed to call
peekToken
>>>>>>>> in afterPosition; this is necessary if your logic about inserting
>>>>>>>> additional tokens leaving a given position depends on future
tokens.
>>>>>>>>
>>>>>>>> But: are you doing any new token insertion?  Or are you just
tweaking
>>>>>>>> the attributes of the tokens that pass through the filter?
 If it's
>>>>>>>> the latter then this class may be overkill ... you could
make a simple
>>>>>>>> TokenFilter.incrementToken that just enumerates & saves
all input
>>>>>>>> tokens, does its processing, then returns those tokens one
by one,
>>>>>>>> instead.
>>>>>>>
>>>>>>> I'm not adding tokens yet, but I will be soon, so all of this
isn't
>>>>>>> entirely crazy. The underlying capability here includes decompounding.
>>>>>>> (I have mixed feelings about just adding all the fragments to
the
>>>>>>> token stream, as it can reduce precision, but there isn't an
obvious
>>>>>>> alternative (except perhaps to suppress the super-common ones)).
>>>>>>>
>>>>>>> So, to summarize, logic might be:
>>>>>>>
>>>>>>> in incrementToken:
>>>>>>>
>>>>>>> If positions.getMaxPos() > -1. just return nextToken(). If
not, loop
>>>>>>> calling peekToken to acquire a sentence, process the sentence,
and
>>>>>>> attach the lemmas and compound-pieces to the Position subclass
>>>>>>> objects.
>>>>>>>
>>>>>>> in afterPosition, as each token comes 'into focus', splat the
lemma
>>>>>>> from the Position into the char term attribute, and insert new
tokens
>>>>>>> as needed for the compound components.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> benson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message