lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Good representation for part-of-speech, chunk, sentence boundary tags?
Date Wed, 04 Jan 2006 13:14:34 GMT

On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:

> On Wednesday 04 January 2006 07:34, Dave Kor wrote:
>> Hi,
>>   I would like to associate information (or labels) with each word  
>> or a
>> range of words in a document. Information such as this word is a  
>> noun, that
>> word is a verb, this period marks the end of a sentence, "kick the  
>> bucket"
>> is a contiguous phrase, "white house" is a location and so on. I  
>> am seeking
>> a good representation for such information so that they can be  
>> easily stored
>> as additional fields in a lucene document, and easily recovered  
>> after a
>> search. For the more technically inclined, this would allow me to  
>> store
>> part-of-speech tags, chunk tags, sentence boundary markers and  
>> parse trees
>> for every indexed document.
>>   These additional information will enable Lucene to perform  
>> additional
>> post-processing on retrieved documents for various purposes such as
>> information extraction, summarization, question answering, etc...  
>> Is there
>> any available api? If not, I would appreciate any suggestions and  
>> tips on
>> how such information can best be stored in a Lucene document.
> Basically, the index information available in Lucene is the Term,  
> which is a
> combination of a field name and a token. For these Lucene indexes
> document presence and all positions within a document.  Lucene also
> indexes the field length as a norm.
> By using one ore more extra fields the tags and sentence boundary  
> markers
> can be easily indexed at their positions. To search these have a  
> look at the
> span package.
> In case you want to search for tokens combined with some (part of  
> speech)
> tag, and the tokens and their tags are in different fields, the  
> span package
> is not sufficient, because it does not allow position search over  
> different
> fields.

Paul - I'm interested in this topic myself.  Suppose the "text" field  
is indexed but also entities are detected like names and places.   
Suppose I'd like a query that was "all names that have the initials  
EH in the text field" (where we could identify EH names by doing a  
SpanRegexQuery for "E.* H.*".

I've been pondering whether it makes sense for Lucene to be enhanced  
to carry over a Token's type into the index such that it could factor  
into the query also.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message