lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Phrase Highlighting
Date Thu, 21 May 2009 21:39:31 GMT
On Thu, May 21, 2009 at 3:09 PM, Max Lynch <> wrote:
> Sorry, the following code is in python, but I can hack a Java thing together
> if necessary.

I'm a big Python fan :)

> HighlighterSpanScorer is the SpanScorer from the highlight
> package just renamed to avoid conflict with the other SpanScorer object.
> Well what happens is if I use a SpanScorer instead, and allocate it like
> such:
>            analyzer = StandardAnalyzer([])
>            tokenStream = analyzer.tokenStream("contents",
> lucene.StringReader(text))
>            ctokenStream = lucene.CachingTokenFilter(tokenStream)
>            highlighter = lucene.Highlighter(formatter,
> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>            ctokenStream.reset()
>            result = highlighter.getBestFragments(ctokenStream, text,
>                    2, "...")
>  My highlighter is still breaking up words inside of a span.  For example,
> if I search for \"John Smith\", instead of the highlighter being called for
> the whole "John Smith", it gets called for "John" and then "Smith".

I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
which is the default used by Highlighter) to ensure that each fragment
contains a full match for the query.  EG something like this (copied
from LIA 2nd edition):

    TermQuery query = new TermQuery(new Term("field", "fox"));

    TokenStream tokenStream =
        new SimpleAnalyzer().tokenStream("field",
            new StringReader(text));

    SpanScorer scorer = new SpanScorer(query, "field",
                                       new CachingTokenFilter(tokenStream));
    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
    Highlighter highlighter = new Highlighter(scorer);

>> > In the mean time, If I am interested in finding out exactly how many
>> times a
>> > term was found in a document, what is the best way to go about this?  The
>> > way I am doing it right now is using a highlighter and just incrementing
>> > counters when a word is found that I'm interested.  I just came across
>> > FieldSortedTermVectorMapper that could do something similar.  Is
>> > FieldSortedTermVectorMapper something I could use for this?  Is there a
>> > better option?
>> Is it really just single terms you need to measure?  (eg, not "how
>> many times did phrase XYZ occur in the doc").  If so, then getting the
>> term vectors and locating your term in there, should work.  This is
>> probably OK if you just do it for each of the hits on the page (like
>> 10 hits), but will be way too slow if you try to do it for say all
>> docs that matched the query.
> I see how the term vector might be used.  I can't really tell if there is a
> way for me to do a Span check on the words as easily as the highlighter
> would do.

TermVectors won't let you do a span check -- they just return the
terms & their frequencies (and optionally positions & offsets, if you
indexed them).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message