lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew May <a...@ingenta.com>
Subject Highlighting problems with HTML tagged fields
Date Fri, 28 Jul 2006 19:48:50 GMT
Hi,

I'm indexing some content that contains HTML markup, and this seems to throw off the 
highlighting somehow.

Example title field:

<SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics
in a polyorogenic 
terrane of NW Iberia

If I search form title:fabrics and turn highlighting on, the highlighted version has the 
<em> tags in the wrong place - 22 characters to the left of where they should be (i.e.
the 
sum of the lengths of the tags).

Because I don't want the tags indexed I'm using a modified version of the "text" field 
type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal 
WhitespaceTokenizerFactory. I've tried using this tokenizer just when indexing, or both 
when indexing and querying, but both do the same thing.

There's no problem if I use the normal WhitespaceTokenizerFactory, but then it's possible

to search the tags and find matches, which isn't ideal.

This is about the closest thing I can find on the Lucene mailing list related to this - 
but this would kind of suggest that this ought to work?

http://www.gossamer-threads.com/lists/lucene/java-user/14981?search_string=HTML%20strip;#14981

Thanks,

Andrew

Mime
View raw message