lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok <ash...@qualcomm.com>
Subject HTML entities being missed by DIH HTMLStripTransformer
Date Wed, 03 Apr 2013 19:00:30 GMT
Hi,

I am using DIH to index some database fields. These fields contain html
formatted text in them. I use the 'HTMLStripTransformer' to remove that
markup. This works fine when the text is like for example:

<li>Item One</li> or *This is in Bold*

However when the text has HTML entity names like in:

&lt;li&gt;Item One&lt;/&gt; or &lt;b&gt;This is in Bold&lt;/b&gt;

NOTHING HAPPENS. 

Two questions.

(1) Is this the expected behavior of DIH HTMLStripTransformer?
(2) If yes, is there an another transformer that I can employ first to turn
these html entities into their usual symbols that can then be removed by the
DIH HTMLStripTransformer?

Thanks

- ashok



--
View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message