lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 21:02:24 GMT
Thanks Benson, I'll have a look.

Mike McCandless

http://blog.mikemccandless.com


On Sat, Sep 7, 2013 at 4:33 PM, Benson Margulies <benson@basistech.com> wrote:
> LUCENE-5202. It seems to show the problem of the extra peek. I'm still
> struggling to make sense of the 'problem' of not always calling
> afterPosition(); that may be entirely my own confusion.
>
> On Sat, Sep 7, 2013 at 4:21 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> That would be awesome, thanks!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies <benson@basistech.com> wrote:
>>> I think I had better build you a test case for this situation, and
>>> attach it to a JIRA.
>>>
>>> On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless
>>> <lucene@mikemccandless.com> wrote:
>>>> Something is wrong; I'm not sure what offhand, but calling peekToken
>>>> 10 times should not stack all tokens @ position 0; it should stack the
>>>> tokens at the positions where they occurred.  Are you sure the posIncr
>>>> att is sometimes 1 (i.e., the position is in fact moving forward for
>>>> some tokens)?
>>>>
>>>> nextToken() only calls peekToken() once the lookahead buffer is exhausted.
>>>>
>>>> afterPosition() should be called within nextToken(), for each
>>>> position, once all tokens leaving that position are done.
>>>>
>>>> You use case *should* be working: inside your incrementToken() you
>>>> call peekToken() over and over until you've seen the full sentence
>>>> (saving away any state in your subclass of Position), then nextToken()
>>>> to emit the buffered tokens, and to insert your own tokens when
>>>> afterPosition() is called ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>> nextToken() calls peekToken(). That seems to prevent my lookahead
>>>>> processing from seeing that item later. Am I missing something?
>>>>>
>>>>>
>>>>> On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>> I think that the penny just dropped, and I should not be using this
class.
>>>>>>
>>>>>> If I call peekToken 10 times while sitting at token 0, this class
will
>>>>>> stack up all 10 of these _at token position 0_. That's not really
very
>>>>>> helpful for what I'm doing. I need to borrow code from this class
and
>>>>>> not use it.
>>>>>>
>>>>>> On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>> Michael,
>>>>>>>
>>>>>>> I'm apparently not fully deconfused yet.
>>>>>>>
>>>>>>> I've got a very simple incrementToken function. It calls peekToken
to
>>>>>>> stack up the tokens.
>>>>>>>
>>>>>>> afterPosition is never called; I expected it to be called as
each of
>>>>>>> the peeked tokens gets next-ed back out.
>>>>>>>
>>>>>>> I assume that I'm missing something simple.
>>>>>>>
>>>>>>>
>>>>>>>     public boolean incrementToken() throws IOException {
>>>>>>>         if (positions.getMaxPos() < 0) {
>>>>>>>             peekSentence();
>>>>>>>         }
>>>>>>>         return nextToken();
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>>> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
>>>>>>>> <lucene@mikemccandless.com> wrote:
>>>>>>>>>
>>>>>>>>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>>>> > I'm trying to work through the logic of reading
ahead until I've seen
>>>>>>>>> > marker for the end of a sentence, then applying
some analysis to all of the
>>>>>>>>> > tokens of the sentence, and then changing some attributes
of each token to
>>>>>>>>> > reflect the results.
>>>>>>>>> >
>>>>>>>>> > The queue of tokens for a position is just a State,
so there isn't an API
>>>>>>>>> > there to set any values.
>>>>>>>>> >
>>>>>>>>> > So do I need to subclass Position for myself, store
the additional
>>>>>>>>> > information in there, and set the attributes as
each token comes by on the
>>>>>>>>> > output side?
>>>>>>>>>
>>>>>>>>> Yes, that sounds right.  Either that or, on emitting
the eventual
>>>>>>>>> Tokens, apply your logic there (because at that point,
after
>>>>>>>>> restoreState, you have access to all the attr values
for that token).
>>>>>>>>>
>>>>>>>>> > I would be grateful for a bit more explanation of
afterPosition versus
>>>>>>>>> > incrementToken; some of the mock classes call peek
from afterPosition, and
>>>>>>>>> > I expected to see peek called in incrementToken
based on the javadoc.
>>>>>>>>>
>>>>>>>>> afterPosition is where your subclass can "insert" new
tokens.
>>>>>>>>>
>>>>>>>>> I think (it's been a while here...) you are allowed to
call peekToken
>>>>>>>>> in afterPosition; this is necessary if your logic about
inserting
>>>>>>>>> additional tokens leaving a given position depends on
future tokens.
>>>>>>>>>
>>>>>>>>> But: are you doing any new token insertion?  Or are you
just tweaking
>>>>>>>>> the attributes of the tokens that pass through the filter?
 If it's
>>>>>>>>> the latter then this class may be overkill ... you could
make a simple
>>>>>>>>> TokenFilter.incrementToken that just enumerates &
saves all input
>>>>>>>>> tokens, does its processing, then returns those tokens
one by one,
>>>>>>>>> instead.
>>>>>>>>
>>>>>>>> I'm not adding tokens yet, but I will be soon, so all of
this isn't
>>>>>>>> entirely crazy. The underlying capability here includes decompounding.
>>>>>>>> (I have mixed feelings about just adding all the fragments
to the
>>>>>>>> token stream, as it can reduce precision, but there isn't
an obvious
>>>>>>>> alternative (except perhaps to suppress the super-common
ones)).
>>>>>>>>
>>>>>>>> So, to summarize, logic might be:
>>>>>>>>
>>>>>>>> in incrementToken:
>>>>>>>>
>>>>>>>> If positions.getMaxPos() > -1. just return nextToken().
If not, loop
>>>>>>>> calling peekToken to acquire a sentence, process the sentence,
and
>>>>>>>> attach the lemmas and compound-pieces to the Position subclass
>>>>>>>> objects.
>>>>>>>>
>>>>>>>> in afterPosition, as each token comes 'into focus', splat
the lemma
>>>>>>>> from the Position into the char term attribute, and insert
new tokens
>>>>>>>> as needed for the compound components.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> benson
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mike McCandless
>>>>>>>>>
>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message