lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duke DAI <duke.dai....@gmail.com>
Subject Re: bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?
Date Tue, 11 Aug 2015 15:38:40 GMT
Greetings!

Any body has input on this?

Best regards,
Duke
If not now, when? If not me, who?

On Fri, Aug 7, 2015 at 10:58 AM, Duke DAI <duke.dai.007@gmail.com> wrote:

> Hi experts,
>
> I'm trying to reproduce a bug from Lucene side, and found something.
>
> In latest codeline, 5.2.1, I modified test
> case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
> below, mainly to use SimpleSpanFragmenter to get only one fragment with
> length 64.
>
>   public void testSimpleQueryTermScorerHighlighter() throws Exception {
>     doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
>     QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
>     Highlighter highlighter = new Highlighter(queryScorer);
>     // Highlighter highlighter = new Highlighter(new
> QueryTermScorer(query));
>     highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
> 64));
>     int maxNumFragmentsRequired = 1;  // only need one fragment
>     for (int i = 0; i < hits.totalHits; i++) {
>       final int docId = hits.scoreDocs[i].doc;
>       final Document doc = searcher.doc(docId);
>       String text = doc.get(FIELD_NAME);
>       TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);
>
>       String result = highlighter.getBestFragments(tokenStream, text,
> maxNumFragmentsRequired,
>           "...");
>       if (true) System.out.println("\t" + result);
>     }
>     // Not sure we can assert anything here - just running to check we dont
>     // throw any exceptions
>   }
>
> With two documents:
> 1. "The word content does not contain the stem that we are looking for but
> the metadata cats does. Do you think fragmenter work well? Do you think
> fragmenter work well?"
> 2. "The word content does not contain the stem that we are looking for but
> the metadata cats does. "
> Got corresponding fragment:
> 1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
> no problem, it's exact what I expected.
> 2. "The word content does not contain the stem that we are looking for but
> the metadata <B>cats</B> does. ", apparently the length is more than 64.
> That's the problem reported by my colleague.
>
> More specific, the problem is caused by below code snippet in
> SimpleSpanFragmenter.isNewFragment:
>
>     boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
> currentNumFrags)
>         && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>>
1);
>
> At the end of text, fragmenter can't stop well and following logic also
> does not do the trim work.
>
>
> Is it possible to handle this corner case in standard highlighter code?
>
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message