lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Alexander <roba...@gmail.com>
Subject Re: bug of highlighter/SimpleSpanFragmenter, returned longer fragment than expected?
Date Wed, 12 Aug 2015 15:23:49 GMT
I've been digging on a similar issue and eventually found this Jira ticket.

https://issues.apache.org/jira/browse/LUCENE-2229

So far I haven't received any response in IRC or from the mailing list, and
the bug is resolved as "won't fix" even though there's a patch attached
that attempts to solve the issue.

For now I have given up. I'm assuming that most of the Lucene community
just doesn't use that highlighter anymore. It is also difficult to
reproduce the issue, so it probably doesn't cause a problem all that often.
It isn't worth my time right now to dig much deeper.

On Tue, Aug 11, 2015 at 10:38 AM, Duke DAI <duke.dai.007@gmail.com> wrote:

> Greetings!
>
> Any body has input on this?
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
> On Fri, Aug 7, 2015 at 10:58 AM, Duke DAI <duke.dai.007@gmail.com> wrote:
>
> > Hi experts,
> >
> > I'm trying to reproduce a bug from Lucene side, and found something.
> >
> > In latest codeline, 5.2.1, I modified test
> > case HighlighterTest.testSimpleQueryTermScorerHighlighter a little to
> > below, mainly to use SimpleSpanFragmenter to get only one fragment with
> > length 64.
> >
> >   public void testSimpleQueryTermScorerHighlighter() throws Exception {
> >     doSearching(new SpanTermQuery(new Term(FIELD_NAME, "cats")));
> >     QueryScorer queryScorer = new QueryScorer(query, FIELD_NAME);
> >     Highlighter highlighter = new Highlighter(queryScorer);
> >     // Highlighter highlighter = new Highlighter(new
> > QueryTermScorer(query));
> >     highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer,
> > 64));
> >     int maxNumFragmentsRequired = 1;  // only need one fragment
> >     for (int i = 0; i < hits.totalHits; i++) {
> >       final int docId = hits.scoreDocs[i].doc;
> >       final Document doc = searcher.doc(docId);
> >       String text = doc.get(FIELD_NAME);
> >       TokenStream tokenStream = getAnyTokenStream(FIELD_NAME, docId);
> >
> >       String result = highlighter.getBestFragments(tokenStream, text,
> > maxNumFragmentsRequired,
> >           "...");
> >       if (true) System.out.println("\t" + result);
> >     }
> >     // Not sure we can assert anything here - just running to check we
> dont
> >     // throw any exceptions
> >   }
> >
> > With two documents:
> > 1. "The word content does not contain the stem that we are looking for
> but
> > the metadata cats does. Do you think fragmenter work well? Do you think
> > fragmenter work well?"
> > 2. "The word content does not contain the stem that we are looking for
> but
> > the metadata cats does. "
> > Got corresponding fragment:
> > 1. "for but the metadata <B>cats</B> does. Do you think fragmenter work",
> > no problem, it's exact what I expected.
> > 2. "The word content does not contain the stem that we are looking for
> but
> > the metadata <B>cats</B> does. ", apparently the length is more than
64.
> > That's the problem reported by my colleague.
> >
> > More specific, the problem is caused by below code snippet in
> > SimpleSpanFragmenter.isNewFragment:
> >
> >     boolean isNewFrag = offsetAtt.endOffset() >= (fragmentSize *
> > currentNumFrags)
> >         && (textSize - offsetAtt.endOffset()) >= (fragmentSize >>>
1);
> >
> > At the end of text, fragmenter can't stop well and following logic also
> > does not do the trim work.
> >
> >
> > Is it possible to handle this corner case in standard highlighter code?
> >
> >
> >
> > Best regards,
> > Duke
> > If not now, when? If not me, who?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message