lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: Good representation for part-of-speech, chunk, sentence boundary tags?
Date Wed, 04 Jan 2006 14:05:35 GMT
On Wednesday 04 January 2006 14:14, Erik Hatcher wrote:
> On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
> > On Wednesday 04 January 2006 07:34, Dave Kor wrote:
> >> Hi,
> >>
> >>
> >>   These additional information will enable Lucene to perform  
> >> additional
> >> post-processing on retrieved documents for various purposes such as
> >> information extraction, summarization, question answering, etc...  
> >> Is there
> >> any available api? If not, I would appreciate any suggestions and  
> >> tips on
> >> how such information can best be stored in a Lucene document.
> >
> > Basically, the index information available in Lucene is the Term,  
> > which is a
> > combination of a field name and a token. For these Lucene indexes
> > document presence and all positions within a document.  Lucene also
> > indexes the field length as a norm.
> > By using one ore more extra fields the tags and sentence boundary  
> > markers
> > can be easily indexed at their positions. To search these have a  
> > look at the
> > span package.
> > In case you want to search for tokens combined with some (part of  
> > speech)
> > tag, and the tokens and their tags are in different fields, the  
> > span package
> > is not sufficient, because it does not allow position search over  
> > different
> > fields.
> Paul - I'm interested in this topic myself.  Suppose the "text" field  
> is indexed but also entities are detected like names and places.   
> Suppose I'd like a query that was "all names that have the initials  
> EH in the text field" (where we could identify EH names by doing a  
> SpanRegexQuery for "E.* H.*".
> I've been pondering whether it makes sense for Lucene to be enhanced  
> to carry over a Token's type into the index such that it could factor  
> into the query also.
> Thoughts?

When the token type is a (small) integer, it could be stored in
addition to the indexed position as additional information in the
proximity index.
The document level  index could also be extended to
to be able to find the documents containing this typed token.

Part of speech tags would map nicely to this token type.
In the (recent) past there has been talk of structured fields and
sentence/paragraph level fields, but I don't recall details of
intended implementations.

Sentence/paragraph level fields are probably better handled by
extra proximity indexes for the same field, ie. without repeating
the term index. For this the document level index would need
a pointer into each proximity index, currently it has only one
pointer into the only proximity index.
When there are only a few types/tags, each of them could be
mapped to a level like this.

So there are (at least) two independent possibilities:
- store type/tag info in the proximity index, and maybe allow
  searching this on document level.
- add more proximity indexes for the same field.

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message