lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: PositionLengthAttribute
Date Sat, 07 Sep 2013 01:28:58 GMT
On Fri, Sep 6, 2013 at 8:03 PM, Benson Margulies <benson@basistech.com> wrote:
> I'm confused by the comment about compound components here.
>
> If a single token fissions into multiple tokens, then what belongs in
> the PositionLengthAttribute. I'm wanting to store a fraction in here!
> Or is the idea to store N in the 'mother' token and then '1' in each
> of the babies?

its the latter. the way its designed to work i think is illustrated
best in kuromoji analyzer where it heuristically decompounds nouns:

if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
these both have posinc=1.
however (to compensate for precision issue you mentioned on the other
thread), it keeps the full compound as a synonym too (there are some
papers benchmarking this approach for decompounding, just think of IDF
etc sorting things out).
so that ABCD synonym has position increment 0, and it "sits" at the
same position as the first token (AB). but it has positionLength=2,
which basically keeps the information in the chain that this "synonym"
spans across both AB and CD.

so the output is like this: AB(posinc=1,posLength=1),
ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message