lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: LookaheadTokenFilter
Date Sat, 07 Sep 2013 01:10:33 GMT
Michael,

I'm apparently not fully deconfused yet.

I've got a very simple incrementToken function. It calls peekToken to
stack up the tokens.

afterPosition is never called; I expected it to be called as each of
the peeked tokens gets next-ed back out.

I assume that I'm missing something simple.


    public boolean incrementToken() throws IOException {
        if (positions.getMaxPos() < 0) {
            peekSentence();
        }
        return nextToken();
    }



On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies <benson@basistech.com> wrote:
> On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies <benson@basistech.com> wrote:
>> > I'm trying to work through the logic of reading ahead until I've seen
>> > marker for the end of a sentence, then applying some analysis to all of the
>> > tokens of the sentence, and then changing some attributes of each token to
>> > reflect the results.
>> >
>> > The queue of tokens for a position is just a State, so there isn't an API
>> > there to set any values.
>> >
>> > So do I need to subclass Position for myself, store the additional
>> > information in there, and set the attributes as each token comes by on the
>> > output side?
>>
>> Yes, that sounds right.  Either that or, on emitting the eventual
>> Tokens, apply your logic there (because at that point, after
>> restoreState, you have access to all the attr values for that token).
>>
>> > I would be grateful for a bit more explanation of afterPosition versus
>> > incrementToken; some of the mock classes call peek from afterPosition, and
>> > I expected to see peek called in incrementToken based on the javadoc.
>>
>> afterPosition is where your subclass can "insert" new tokens.
>>
>> I think (it's been a while here...) you are allowed to call peekToken
>> in afterPosition; this is necessary if your logic about inserting
>> additional tokens leaving a given position depends on future tokens.
>>
>> But: are you doing any new token insertion?  Or are you just tweaking
>> the attributes of the tokens that pass through the filter?  If it's
>> the latter then this class may be overkill ... you could make a simple
>> TokenFilter.incrementToken that just enumerates & saves all input
>> tokens, does its processing, then returns those tokens one by one,
>> instead.
>
> I'm not adding tokens yet, but I will be soon, so all of this isn't
> entirely crazy. The underlying capability here includes decompounding.
> (I have mixed feelings about just adding all the fragments to the
> token stream, as it can reduce precision, but there isn't an obvious
> alternative (except perhaps to suppress the super-common ones)).
>
> So, to summarize, logic might be:
>
> in incrementToken:
>
> If positions.getMaxPos() > -1. just return nextToken(). If not, loop
> calling peekToken to acquire a sentence, process the sentence, and
> attach the lemmas and compound-pieces to the Position subclass
> objects.
>
> in afterPosition, as each token comes 'into focus', splat the lemma
> from the Position into the char term attribute, and insert new tokens
> as needed for the compound components.
>
> Thanks,
> benson
>
>
>
>
>
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message