lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pulkitsing...@gmail.com
Subject Re: escaping HTML tags within XML file
Date Mon, 26 Sep 2011 01:54:37 GMT
Yes sir!

Sent from my iPhone

On Sep 25, 2011, at 4:06 PM, okayndc <bodymoves@gmail.com> wrote:

> Here is a representation of the XML file...
> 
> <root>
> <commenter>
> <comment><p>Text here</p><img src="image.gif" /><p>More
text
> here....</p></comment>
> </commenter>
> </root>
> 
> I want to keep the HTML tags because it keeps the formatting (paragraph
> tags, etc) intact for the output.  Seems like you're saying that the HTML
> can be kept intact with the use of a HTML field type without having to
> escape the HTML tags?
> 
> On Sun, Sep 25, 2011 at 2:52 PM, <pulkitsinghal@gmail.com> wrote:
> 
>> Assuming that the XML has the HTML as values inside fully formed tags like
>> so:
>> <node><HTML></HTML></node> then I think that using the "HTML"
field type in
>> schema.xml for indexing/storing will allow you to do meaningful searches on
>> the content of the HTML without getting confused by the HTML syntax itself.
>> 
>> If you have absolutely no need for the entire stored HTML when presenting
>> results to the user then stripping out the syntax at index time makes sense.
>> This will adversely affect highlighting of  that document field as well so
>> just know your requirements.
>> 
>> If you don't want to present anything at all then don't store, just index
>> and use the right field type (HTML) such that search results find the right
>> document. Just because a field is helpful in finding the doc, doesn't mean
>> folks always want to present it or store it.
>> 
>> With Data Import Handler a HTML stripping transformer is present so that it
>> is removed before the indexer gets it's hands on things. I can't be sure if
>> that is how you get your data into Solr.
>> 
>> - Pulkit
>> 
>> Sent from my iPhone
>> 
>> On Sep 25, 2011, at 8:00 AM, okayndc <bodymoves@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> Was wondering if it is necessary to escape HTML tags within an XML file
>> for
>>> indexing?  If so, seems like a large XML files with tons of HTML tags
>> could
>>> get really messy (using CDATA).
>>> Has this been your experience?  Do you escape the HTML tags? If so, what
>>> technique do you use? Or do you leave the HTML tags in place without
>>> escaping them?
>>> 
>>> Thanks!
>> 

Mime
View raw message