lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 20:21:43 GMT
That would be awesome, thanks!

Mike McCandless

http://blog.mikemccandless.com


On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies <benson@basistech.com> wrote:
> I think I had better build you a test case for this situation, and
> attach it to a JIRA.
>
> On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> Something is wrong; I'm not sure what offhand, but calling peekToken
>> 10 times should not stack all tokens @ position 0; it should stack the
>> tokens at the positions where they occurred.  Are you sure the posIncr
>> att is sometimes 1 (i.e., the position is in fact moving forward for
>> some tokens)?
>>
>> nextToken() only calls peekToken() once the lookahead buffer is exhausted.
>>
>> afterPosition() should be called within nextToken(), for each
>> position, once all tokens leaving that position are done.
>>
>> You use case *should* be working: inside your incrementToken() you
>> call peekToken() over and over until you've seen the full sentence
>> (saving away any state in your subclass of Position), then nextToken()
>> to emit the buffered tokens, and to insert your own tokens when
>> afterPosition() is called ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies <benson@basistech.com> wrote:
>>> nextToken() calls peekToken(). That seems to prevent my lookahead
>>> processing from seeing that item later. Am I missing something?
>>>
>>>
>>> On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>> I think that the penny just dropped, and I should not be using this class.
>>>>
>>>> If I call peekToken 10 times while sitting at token 0, this class will
>>>> stack up all 10 of these _at token position 0_. That's not really very
>>>> helpful for what I'm doing. I need to borrow code from this class and
>>>> not use it.
>>>>
>>>> On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>> Michael,
>>>>>
>>>>> I'm apparently not fully deconfused yet.
>>>>>
>>>>> I've got a very simple incrementToken function. It calls peekToken to
>>>>> stack up the tokens.
>>>>>
>>>>> afterPosition is never called; I expected it to be called as each of
>>>>> the peeked tokens gets next-ed back out.
>>>>>
>>>>> I assume that I'm missing something simple.
>>>>>
>>>>>
>>>>>     public boolean incrementToken() throws IOException {
>>>>>         if (positions.getMaxPos() < 0) {
>>>>>             peekSentence();
>>>>>         }
>>>>>         return nextToken();
>>>>>     }
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
>>>>>> <lucene@mikemccandless.com> wrote:
>>>>>>>
>>>>>>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com>
wrote:
>>>>>>> > I'm trying to work through the logic of reading ahead until
I've seen
>>>>>>> > marker for the end of a sentence, then applying some analysis
to all of the
>>>>>>> > tokens of the sentence, and then changing some attributes
of each token to
>>>>>>> > reflect the results.
>>>>>>> >
>>>>>>> > The queue of tokens for a position is just a State, so there
isn't an API
>>>>>>> > there to set any values.
>>>>>>> >
>>>>>>> > So do I need to subclass Position for myself, store the
additional
>>>>>>> > information in there, and set the attributes as each token
comes by on the
>>>>>>> > output side?
>>>>>>>
>>>>>>> Yes, that sounds right.  Either that or, on emitting the eventual
>>>>>>> Tokens, apply your logic there (because at that point, after
>>>>>>> restoreState, you have access to all the attr values for that
token).
>>>>>>>
>>>>>>> > I would be grateful for a bit more explanation of afterPosition
versus
>>>>>>> > incrementToken; some of the mock classes call peek from
afterPosition, and
>>>>>>> > I expected to see peek called in incrementToken based on
the javadoc.
>>>>>>>
>>>>>>> afterPosition is where your subclass can "insert" new tokens.
>>>>>>>
>>>>>>> I think (it's been a while here...) you are allowed to call peekToken
>>>>>>> in afterPosition; this is necessary if your logic about inserting
>>>>>>> additional tokens leaving a given position depends on future
tokens.
>>>>>>>
>>>>>>> But: are you doing any new token insertion?  Or are you just
tweaking
>>>>>>> the attributes of the tokens that pass through the filter?  If
it's
>>>>>>> the latter then this class may be overkill ... you could make
a simple
>>>>>>> TokenFilter.incrementToken that just enumerates & saves all
input
>>>>>>> tokens, does its processing, then returns those tokens one by
one,
>>>>>>> instead.
>>>>>>
>>>>>> I'm not adding tokens yet, but I will be soon, so all of this isn't
>>>>>> entirely crazy. The underlying capability here includes decompounding.
>>>>>> (I have mixed feelings about just adding all the fragments to the
>>>>>> token stream, as it can reduce precision, but there isn't an obvious
>>>>>> alternative (except perhaps to suppress the super-common ones)).
>>>>>>
>>>>>> So, to summarize, logic might be:
>>>>>>
>>>>>> in incrementToken:
>>>>>>
>>>>>> If positions.getMaxPos() > -1. just return nextToken(). If not,
loop
>>>>>> calling peekToken to acquire a sentence, process the sentence, and
>>>>>> attach the lemmas and compound-pieces to the Position subclass
>>>>>> objects.
>>>>>>
>>>>>> in afterPosition, as each token comes 'into focus', splat the lemma
>>>>>> from the Position into the char term attribute, and insert new tokens
>>>>>> as needed for the compound components.
>>>>>>
>>>>>> Thanks,
>>>>>> benson
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message