lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: escaping HTML tags within XML file
Date Mon, 26 Sep 2011 01:54:37 GMT
Yes sir!

Sent from my iPhone

On Sep 25, 2011, at 4:06 PM, okayndc <> wrote:

> Here is a representation of the XML file...
> <root>
> <commenter>
> <comment><p>Text here</p><img src="image.gif" /><p>More
> here....</p></comment>
> </commenter>
> </root>
> I want to keep the HTML tags because it keeps the formatting (paragraph
> tags, etc) intact for the output.  Seems like you're saying that the HTML
> can be kept intact with the use of a HTML field type without having to
> escape the HTML tags?
> On Sun, Sep 25, 2011 at 2:52 PM, <> wrote:
>> Assuming that the XML has the HTML as values inside fully formed tags like
>> so:
>> <node><HTML></HTML></node> then I think that using the "HTML"
field type in
>> schema.xml for indexing/storing will allow you to do meaningful searches on
>> the content of the HTML without getting confused by the HTML syntax itself.
>> If you have absolutely no need for the entire stored HTML when presenting
>> results to the user then stripping out the syntax at index time makes sense.
>> This will adversely affect highlighting of  that document field as well so
>> just know your requirements.
>> If you don't want to present anything at all then don't store, just index
>> and use the right field type (HTML) such that search results find the right
>> document. Just because a field is helpful in finding the doc, doesn't mean
>> folks always want to present it or store it.
>> With Data Import Handler a HTML stripping transformer is present so that it
>> is removed before the indexer gets it's hands on things. I can't be sure if
>> that is how you get your data into Solr.
>> - Pulkit
>> Sent from my iPhone
>> On Sep 25, 2011, at 8:00 AM, okayndc <> wrote:
>>> Hello,
>>> Was wondering if it is necessary to escape HTML tags within an XML file
>> for
>>> indexing?  If so, seems like a large XML files with tons of HTML tags
>> could
>>> get really messy (using CDATA).
>>> Has this been your experience?  Do you escape the HTML tags? If so, what
>>> technique do you use? Or do you leave the HTML tags in place without
>>> escaping them?
>>> Thanks!

View raw message