lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duke DAI <duke.dai....@gmail.com>
Subject bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?
Date Fri, 07 Aug 2015 02:58:43 GMT
Hi experts,

I'm trying to reproduce a bug from Lucene side, and found something.

In latest codeline, 5.2.1, I modified test
case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
below, mainly to use SimpleSpanFragmenter to get only one fragment with
length 64.

  public void testSimpleQueryTermScorerHighlighter() throws Exception {
    doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
    QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(queryScorer);
    // Highlighter highlighter = new Highlighter(new
QueryTermScorer(query));
    highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
64));
    int maxNumFragmentsRequired = 1;  // only need one fragment
    for (int i = 0; i < hits.totalHits; i++) {
      final int docId = hits.scoreDocs[i].doc;
      final Document doc = searcher.doc(docId);
      String text = doc.get(FIELD_NAME);
      TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);

      String result = highlighter.getBestFragments(tokenStream, text,
maxNumFragmentsRequired,
          "...");
      if (true) System.out.println("\t" + result);
    }
    // Not sure we can assert anything here - just running to check we dont
    // throw any exceptions
  }

With two documents:
1. "The word content does not contain the stem that we are looking for but
the metadata cats does. Do you think fragmenter work well? Do you think
fragmenter work well?"
2. "The word content does not contain the stem that we are looking for but
the metadata cats does. "
Got corresponding fragment:
1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
no problem, it's exact what I expected.
2. "The word content does not contain the stem that we are looking for but
the metadata <B>cats</B> does. ", apparently the length is more than 64.
That's the problem reported by my colleague.

More specific, the problem is caused by below code snippet in
SimpleSpanFragmenter.isNewFragment:

    boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
currentNumFrags)
        && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>> 1);

At the end of text, fragmenter can't stop well and following logic also
does not do the trim work.


Is it possible to handle this corner case in standard highlighter code?



Best regards,
Duke
If not now, when? If not me, who?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message