lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yuanyun.cn (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-5381) Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment
Date Wed, 01 Jan 2014 15:22:50 GMT
yuanyun.cn created LUCENE-5381:
----------------------------------

             Summary: Lucene highlighter doesn't honor hl.fragsize; it appends all text for
last fragment
                 Key: LUCENE-5381
                 URL: https://issues.apache.org/jira/browse/LUCENE-5381
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/highlighter
    Affects Versions: 4.6, 4.0
            Reporter: yuanyun.cn
            Priority: Minor
             Fix For: 5.0, 4.7
         Attachments: LUCENE-5381.patch

Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, but the highlight
section for one document oupputs more than 2000 characters.

Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream,
String, boolean, int),  after the for loop, it appends whole remaining text into last fragment.
if (
		// if there is text beyond the last token considered..
		(lastEndOffset < text.length())
		&&
		// and that text is not too large...
		(text.length()<= maxDocCharsToAnalyze)
	)
{
	//append it to the last fragment
	newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most relevant section
and will be selected to return to client.

I made some change to the code like below:  It seems work for me :)
//Test what remains of the original text beyond the point where we stopped analyzing
if(lastEndOffset < text.length())
{
	if(textFragmenter instanceof SimpleFragmenter)
	{
		SimpleFragmenter simpleFragmenter = (SimpleFragmenter) textFragmenter;
		int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
		if(remain > 0 )
		{
			int endIndex = lastEndOffset + remain;
			if (endIndex > text.length()) {
				endIndex = text.length();
			}
			newText.append(encoder.encodeText(text.substring(lastEndOffset,
					endIndex)));
		}
	}
	else
	{
		newText.append(encoder.encodeText(text.substring(lastEndOffset)));
	}
}
currentFrag.textEndPos = newText.length();



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message