lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Highlighting problems with HTML tagged fields
Date Fri, 28 Jul 2006 20:23:15 GMT
On 7/28/06, Andrew May <amay@ingenta.com> wrote:
> Because I don't want the tags indexed I'm using a modified version of the "text" field
> type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal
> WhitespaceTokenizerFactory.

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader  could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this?  The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

-Yonik

Mime
View raw message