lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Good representation for part-of-speech, chunk, sentence boundary tags?
Date Wed, 04 Jan 2006 14:05:12 GMT
We tend to store this information externally in XML (or in a stored 
field), but that is probably more of an artifact of our other uses of 
the information outside of Lucene than any shortcoming in Lucene.

A few of us on the list had some ideas on sentence boundary storage 
You could do similar things with some of the other information I 
suspect.  Don't know a lot about how it works, but I think you might be 
able to leverage the fact that you can store multiple tokens at the same 
position.  For instance, you may be able to add the POS of label at the 
same position as the term.  Perhaps Erik could add whether this works or not


Erik Hatcher wrote:

> On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
>> On Wednesday 04 January 2006 07:34, Dave Kor wrote:
>>> Hi,
>>>   I would like to associate information (or labels) with each word  
>>> or a
>>> range of words in a document. Information such as this word is a  
>>> noun, that
>>> word is a verb, this period marks the end of a sentence, "kick the  
>>> bucket"
>>> is a contiguous phrase, "white house" is a location and so on. I  am 
>>> seeking
>>> a good representation for such information so that they can be  
>>> easily stored
>>> as additional fields in a lucene document, and easily recovered  
>>> after a
>>> search. For the more technically inclined, this would allow me to  
>>> store
>>> part-of-speech tags, chunk tags, sentence boundary markers and  
>>> parse trees
>>> for every indexed document.
>>>   These additional information will enable Lucene to perform  
>>> additional
>>> post-processing on retrieved documents for various purposes such as
>>> information extraction, summarization, question answering, etc...  
>>> Is there
>>> any available api? If not, I would appreciate any suggestions and  
>>> tips on
>>> how such information can best be stored in a Lucene document.
>> Basically, the index information available in Lucene is the Term,  
>> which is a
>> combination of a field name and a token. For these Lucene indexes
>> document presence and all positions within a document.  Lucene also
>> indexes the field length as a norm.
>> By using one ore more extra fields the tags and sentence boundary  
>> markers
>> can be easily indexed at their positions. To search these have a  
>> look at the
>> span package.
>> In case you want to search for tokens combined with some (part of  
>> speech)
>> tag, and the tokens and their tags are in different fields, the  span 
>> package
>> is not sufficient, because it does not allow position search over  
>> different
>> fields.
> Paul - I'm interested in this topic myself.  Suppose the "text" field  
> is indexed but also entities are detected like names and places.   
> Suppose I'd like a query that was "all names that have the initials  
> EH in the text field" (where we could identify EH names by doing a  
> SpanRegexQuery for "E.* H.*".
> I've been pondering whether it makes sense for Lucene to be enhanced  
> to carry over a Token's type into the index such that it could factor  
> into the query also.
> Thoughts?
>     Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 
Voice:  315-443-5484 
Fax: 315-443-6886 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message