lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Tue, 24 Feb 2004 22:24:40 GMT
It is the location of the token in the document (see IndexReader.termPositions()).  This information
is already being stored in other parts of the index, it just isn't very efficient to get at

I think it would be useful to add to the IndexReader a way to get a list of positions given
a term and a document, then we wouldn't have to store this info twice.  Something like: 

TermPositions termPositions(Term term, Document doc);

which would return a subset of IndexReader.termPositions(Term term) containing only those
Positions that are in the Document.  This would need to be implemented in an efficient manner,
not just the brute force method of looping over termPositions(Term term).  I don't know how
easy this would be to do, as I am not familiar with the file structure of the Position information.

At least that is my understanding of it, perhaps others have more insight.


>>> 02/24/04 04:20PM >>>
Doug Cutting wrote:

> Grant Ingersoll wrote:
>> Do you see any reason to write position information at all for the 
>> term vectors?
> It could be useful to some folks.  If, for example, you only want to 
> expand a query with terms that occur near query terms, like automatic 
> phrase identification.  In general, the vector stuff is just a constant 
> factor improvement over re-tokenizing the text of the document, but 
> hopefully a substantial one.  If folks are doing computations which 
> require positional information, but don't require the actual text (e.g., 
> they don't need user-readable fragments) then positions could be handy.
> But, certainly, most applications for term vectors do not need 
> positions, and I would not be upset if these were left out of the first 
> version.

Forgive me for being thick, however what position information are we talking about here? The
and end position of the token in the source text that the term came from? If so I think it
would be 
useful to have them in at some point as I believe they could be used to optimized the query

highlighting code that Mark Harwood contributed to not have to reanalyze the text every time
wanted to generate a highlighted search summary.


Bruce Ritchie

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message