lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Good representation for part-of-speech, chunk, sentence boundary tags?
Date Wed, 04 Jan 2006 12:53:51 GMT
On Wednesday 04 January 2006 07:34, Dave Kor wrote:
> Hi,
> 
>   I would like to associate information (or labels) with each word or a
> range of words in a document. Information such as this word is a noun, that
> word is a verb, this period marks the end of a sentence, "kick the bucket"
> is a contiguous phrase, "white house" is a location and so on. I am seeking
> a good representation for such information so that they can be easily stored
> as additional fields in a lucene document, and easily recovered after a
> search. For the more technically inclined, this would allow me to store
> part-of-speech tags, chunk tags, sentence boundary markers and parse trees
> for every indexed document.
> 
>   These additional information will enable Lucene to perform additional
> post-processing on retrieved documents for various purposes such as
> information extraction, summarization, question answering, etc... Is there
> any available api? If not, I would appreciate any suggestions and tips on
> how such information can best be stored in a Lucene document.

Basically, the index information available in Lucene is the Term, which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document.  Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundary markers
can be easily indexed at their positions. To search these have a look at the
span package.
In case you want to search for tokens combined with some (part of speech)
tag, and the tokens and their tags are in different fields, the span package
is not sufficient, because it does not allow position search over different
fields.
One use for positions as sentence boundary markers is to leave
gaps at the sentence boundaries. This can safely be done when the
slop (allowed distance) in the queries is always smaller than this gap.

Parse trees carry (much) more information, and these will not be easy to map
to Lucene, but it all depends on the searches you want to support.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message