lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grijesh <>
Subject Re: Inquiring part-of-speech (POS) tagging indexing and searching
Date Tue, 03 May 2011 06:02:31 GMT
As you have seen the example code for PartOfSpeechTaggingFilter at

You can use a custom analyzer to inject "metadata" tokens into the index at
the same position as the source tokens.

For example, given the text:
    The     cat     jumped     over     the     dog
your analyzer could emit tokens:
    [the]     [cat,_posNoun] [jumped,_posVerb]     [over]     [the]    

where the "_pos...." tokens have a zero position increment to effectively
associate them with the term to which they relate (this is how the example
SynonymTokenizer in the highlighter
package works). The "_pos" prefix is used as a uniquefier for metadata
tokens to avoid any name-clashes with any real content tokens.

Theoretically you could then construct queries where the queries mixed both
data and your part-of-speech metadata eg you could use the position
information based queries to find out
what things normally have a particular verb applied to them:
     "jumped  _posNoun"~3
  or what verbs are commonly associated with a dog (caution advised here):
    "_posVerb the dog"~3
or to use an ambiguous word in a particular context/sense
    "_posVerb track"~1

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message