tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-651) Unescaped attribute value generated
Date Sun, 01 May 2011 08:45:03 GMT

     [ https://issues.apache.org/jira/browse/TIKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Uwe Schindler updated TIKA-651:

    Attachment: XHTMLSerializer.java

Yes, as per SAX spec, the characters() event gets unescaped text (also startElement()'s attributes).
So the code inside the content handler that writes out the text must escape it.

If someone is interested, I have an xhtml conform content handler (that also makes HTML4-compatible
XHTML, normal serializers just produce XML *or* HTML4), that is based off XALAN/XERCES serializer.jar,
but adds some filtering, so the outputted XHTML has correct block tags, has space before "/>",
and e.g. always separates start/end tag of some elements like <script></script>
even if they are empty (because only attributes set). It also makes style/script always use
CDATA with a faked comment to make this even compatible with HTML4, where SCRIPT/STYLE is
defined to be CDATA by default (adding fake comments).

You only need to add a proper DOCTYPE currently (its not added to this handler for our internal
purposes, because we also serialize fragments usig that class, so it omits doctypes).

I attached this document handler, maybe it is of some use for somebody. It needs serializer.jar
from XERCES/XALAN to compile and work.

This is currently as far as I know the only XHTML serializer that produces XHTML that is in
generall needed to make XHTML documents behave correctly even with browsers that only support
HTML4. It passes both HTML4 and XHTML1 validators (of course the element data needs to be

> Unescaped attribute value generated
> -----------------------------------
>                 Key: TIKA-651
>                 URL: https://issues.apache.org/jira/browse/TIKA-651
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Raimund Merkert
>         Attachments: XHTMLSerializer.java
> I've converted a word document that contains hyperlinks with a complex query component.
The & character is not escaped and mozilla complains about that when I write out the XHTML
via a content handler that I wrote.
> It's not clear to me whether or not my contenthandler should assume attributes are properly
escaped or not.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message