lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Kor <>
Subject Good representation for part-of-speech, chunk, sentence boundary tags?
Date Wed, 04 Jan 2006 06:34:12 GMT

  I would like to associate information (or labels) with each word or a
range of words in a document. Information such as this word is a noun, that
word is a verb, this period marks the end of a sentence, "kick the bucket"
is a contiguous phrase, "white house" is a location and so on. I am seeking
a good representation for such information so that they can be easily stored
as additional fields in a lucene document, and easily recovered after a
search. For the more technically inclined, this would allow me to store
part-of-speech tags, chunk tags, sentence boundary markers and parse trees
for every indexed document.

  These additional information will enable Lucene to perform additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc... Is there
any available api? If not, I would appreciate any suggestions and tips on
how such information can best be stored in a Lucene document.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message