lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: What is correct use of HTMLStripCharFilter in Solr 3.1
Date Thu, 12 May 2011 18:55:40 GMT
> I recently upgraded from Solr 1.3 to Solr 3.1 in order to
> take advantage of
> the HTMLStripCharFilter. But it isn't working as I
> expected.
> 
> I have a text field that may contain HTML tags. I however
> would like to
> store it in Solr without the HTML tags. And retrieve the
> text field for
> display and for highlighting without HTML tags.
> 
> I added <charFilter
> class="solr.HTMLStripCharFilterFactory"/> to the top of
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100"
> autoGeneratePhraseQueries="true"> in the schema.xml file
> of the solr
> example, both in <analyzer type="index"> and in
> <analyzer type="query">.
> 
> And the text field is simply:
> 
> <field name="text" type="text" indexed="true"
> stored="true"/>
> 
> Now, when I do a search. The text field still has all the
> HTML tags in them
> and the highlighting is totally screwed up with em tags
> around virtually
> every word. What am I doing wrong?

You need to strip html tag before analysis phase. If you are using DIH, you can use stripHTML="true"
transformer.

Mime
View raw message