lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <>
Subject [jira] [Updated] (LUCENE-6445) Highlighter TokenSources simplification; just one getAnyTokenStream()
Date Fri, 24 Apr 2015 20:35:38 GMT


David Smiley updated LUCENE-6445:
    Attachment: LUCENE-6445_TokenSources_simplification.patch

Attached patch.
The 2nd method name is actually "getTermVectorTokenStreamOrNull", and I decided that positions
on the term vector needn't be a hard requirement.  

The patch adds a test for the maxStartOffset behavior. The javadocs for these two methods
are quite complete, including a warning about multi-valued fields.  Solr calls one of these
now with the maxStartOffset, so it will benefit.  Updating  all the test calls was a bit tedious.

Also, this highlighter module now depends on analysis-common for the LimitTokenOffsetFilter.

> Highlighter TokenSources simplification; just one getAnyTokenStream()
> ---------------------------------------------------------------------
>                 Key: LUCENE-6445
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE-6445_TokenSources_simplification.patch
> The Highlighter "TokenSources" class has quite a few utility methods pertaining to getting
a TokenStream from either term vectors or analyzed text.  I think it's too much:
> * some go to term vectors, some don't.  But if you don't want to go to term vectors,
then it's quite easy for the caller to invoke the Analyzer for the field value, and to get
that field value.
> * Some methods return null, some never null; I forget which at a glance.
> * Some methods read the Document (to get a field value) from the IndexReader, some don't.
 Furthermore, it's not an ideal place to get the doc since your app might be using an IndexSearcher
with a document cache (e.g. SolrIndexSearcher).
> * None of the methods accept a Fields instance from term vectors as a parameter.  Based
on how Lucene's term vector format works, this is a performance trap if you don't re-use an
instance across fields on the document that you're highlighting.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message