lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Incorrect Token Offset when using multiple fieldable instance
Date Wed, 02 Jul 2008 12:32:47 GMT

This would actually be a fairly large change: it's a change to the  
index format and all APIs that handle offsets during indexing &  

We could alternatively extend TokenStream so you could query it for  
the final offset, then fix indexing to use that value instead of the  
endOffset of the last token that it saw.


Toph wrote:

> Interesting discussion... glad I'm not the only one with this  
> challenge.
> Michael McCandless-2 wrote:
>> EG, if you use Highlighter on a
>> multi-valued field indexed with stored field & term vectors and say
>> the first field ended with a stop word that was filtered out, then
>> your offsets will be off and the wrong parts will be highlighted
> I found this post by attempting just this exact thing, and I can  
> confirm,
> that yes, the offsets are incorrect for all but the first instance  
> of the
> field in the document, so they are useless for highlighting.  I tried
> concatenating all instances of the fields, but of course if an  
> instance of
> the field ended with punctuation or a stop word, those characters  
> were not
> added to the offset.  I'll try the suggested workaround re adding a  
> false
> term at the end of each field, but a better API would be if "offset"  
> became
> a pair of ints, first being the index of the Field for  
> getFields(name) and
> the second being the offset in that instance of the field.
> Christopher
> -- 
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message