lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Pigulla (JIRA)" <>
Subject [jira] [Commented] (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory
Date Mon, 02 May 2011 12:01:04 GMT


Matthias Pigulla commented on SOLR-42:

I don't think it's a duplicate and the issue is still unresolved at least in regard to [#comment-12625835]
and the 1.4.1 release.

The input string "<??>xx yy xx" will have the start offsets for xx, yy and xx at 3,
6 and 9 respectively and is off by one.

"<??><??>xx yy xx" will even have 6, 9 and 12, that is, every "<??>" (as
a special "degenerated" kind of XML PI) will shift the offset by one.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>                 Key: SOLR-42
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments:, HtmlStripReaderTestXmlProcessing.patch,
HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch,, htmlStripReaderTest.html
> Indexing content that contains HTML markup, causes problems with highlighting if the
HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic
fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em>
tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum
of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message