cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Stevens <>
Subject RE: Parsing HTML entities
Date Fri, 31 Aug 2007 15:58:18 GMT

Oh, for crying out loud.  Even after switching to plain text Hotmail still 
strips out my included XML :-(
Let's try again - replace the square brackets below with the appropriate 
less-than and greater-than symbols.

> From:
> Date: Fri, 31 Aug 2007 14:06:59 +0000
> Tobia Conforto < tobia.conforto < at>> writes:
>> I have a data source from which I get SAX text nodes into my pipeline
>> that contain escaped HTML entities and 
 tags. In Java syntax:
>> "Lorem ipsum — dolor sit amet. < br> Consectetuer"
>> or, in XML syntax:
>> Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer
>> As you can see, the entities and < br> tags are escaped and part of the
>> text node.
>> I cannot change this data source component, therefore I need a
>> transformer to examine every text node in the stream, split it at the
>> fake "< br>" tags, substitute them with < xhtml:br/> elements, and
>> replace every escaped entity with the relevant Unicode character.
> That's one of the rare cases where I consider < xsl:text
> disable-output-escaping="yes"> a valid approach [1]. I don't know if there is
> something comparable directly on the Java side.

Unless I'm mistaken, doing that on his example would result in an invalid
document as there's no matching [/br] element...?  It would be okay if it
can be guaranteed that the included text is nice well-formed XHTML, but if
it's plain old HTML then it sounds to me more like a job for the jtidy or
neko-based HTML transformers.

We have something similar in our application; I arrange the early part of the 
pipeline so that the escaped HTML appears within a unique element e.g.

[some_escaped_html]Lorem ipsum & lt;br& gt; dolor[/some_escaped_html]

, pass it through the html transformer

[map:transform type="html"]
[map:parameter name="tags" value="some_escaped_html"/]

and follow that by a small xsl transformation to strip out the some_escaped_html
elements (and the html & body elements that JTidy inserts)

[xsl:template match="vf_escaped_html"]
[xsl:apply-templates select="html/body/*"/]
+ the usual "passthrough" templates for all other nodes.

Net result, the same SAX stream but with the HTML unescaped and cleaned
up so it's well-formed again.


Get free emoticon packs and customisation from Windows Live.
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message