lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <>
Subject Re: Term highlighting and Term vector patch
Date Thu, 16 Sep 2004 09:01:12 GMT
Grant Ingersoll wrote:
> Hi,
> I was browsing the term highlighting code in the sandbox and I noticed
> the following comment for the getBestFragment method in the
> code:
> 	/**
> ...
> 	 * @param tokenStream   a stream of tokens identified in the
> text parameter, including offset information. 
> 	 * This is typically produced by an analyzer re-parsing a
> document's 
> 	 * text. Some work may be done on retrieving TokenStreams more
> efficently 
> 	 * by adding support for storing original text position data in
> the Lucene
> 	 * index but this support is not currently available (as of
> Lucene 1.4 rc2).  
> ...
> 	 */
> which struck me that I might be able to contribute some more time to
> make this so, since I recently submitted a patch to offer just such an
> enhancement to the term vector.
> I would like to implement this, but I don't really want to submit a
> patch against another patch (It's hard enough managing all the changes
> that come down).  So, I was wondering if anyone (i.e. a committer) has
> had a chance to look at the Term Vector offset patch and what their
> thoughts are on it?  I can see the performance improvements in the
> highlighter that would come about by avoiding having to re-analyze the
> text, plus you could highlight the whole field if you wanted to.

Hi Grant,

I try to look into your latest code by the end of September but I probably
won't find time earlier. I am using the current TermVectors very successfully.
Thanks for the excellent code.

Your new patch provides the ability to store positions and token offset,
doesn't it?

As far as I remember, there is also Bernhard's patch for making TermVectors
more efficient in case of multiple threads using one IndexReader, and there
are the API changes from Daniel that might influence your patch too. Is all
this in sync?

> Also, if I make this change, do the committers suggest I keep the
> current ability to analyze and have this as an alternative, or would it
> be safe to assume this is only used when offset info is stored?

Storing the offsets will increase index size considerably. So one will not
always want to do that. I guess highlighting should continue to work with
reanalyzing. However, I know that this makes coding much more complex. You
always have to maintain two versions of the highlighter ....
What do others think?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message