lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: PositionLengthAttribute
Date Sat, 07 Sep 2013 01:37:37 GMT
On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies <benson@basistech.com> wrote:
> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> its the latter. the way its designed to work i think is illustrated
>> best in kuromoji analyzer where it heuristically decompounds nouns:
>>
>> if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
>> these both have posinc=1.
>> however (to compensate for precision issue you mentioned on the other
>> thread), it keeps the full compound as a synonym too (there are some
>> papers benchmarking this approach for decompounding, just think of IDF
>> etc sorting things out).
>> so that ABCD synonym has position increment 0, and it "sits" at the
>> same position as the first token (AB). but it has positionLength=2,
>> which basically keeps the information in the chain that this "synonym"
>> spans across both AB and CD.
>>
>> so the output is like this: AB(posinc=1,posLength=1),
>> ABCD(posinc=0,posLength=2), CD(posinc=1, posLength=1)
>
> I suppose this works best if you actually know the offsets of the
> pieces. In disassembling German, this is not always straightforward.
>

i dont really see how it has anything to do with natural languages?
its just the way you represent the compound components in the
tokenstream.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message