tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Butters (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
Date Mon, 07 Oct 2019 21:42:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946270#comment-16946270
] 

Luke Butters edited comment on TIKA-2955 at 10/7/19 9:41 PM:
-------------------------------------------------------------

So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML]
has this to says for XML 1.0 this range is valid:
{quote}
    U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP
(all surrogates, U+FFFE and U+FFFF are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only valid in certain
contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged:
    U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one
C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references
it says:
{quote}
The numeric character reference forms described above are allowed to reference any Unicode
code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters),
and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;<control>;Cc;0;BN;;;;;N;DELETE;;;;
0080;<control>;Cc;0;BN;;;;;N;;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;
0083;<control>;Cc;0;BN;;;;;N;NO BREAK HERE;;;;
0084;<control>;Cc;0;BN;;;;;N;;;;;
0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;
0086;<control>;Cc;0;BN;;;;;N;START OF SELECTED AREA;;;;
0087;<control>;Cc;0;BN;;;;;N;END OF SELECTED AREA;;;;
0088;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION SET;;;;
0089;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION WITH JUSTIFICATION;;;;
008A;<control>;Cc;0;BN;;;;;N;LINE TABULATION SET;;;;
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
008C;<control>;Cc;0;BN;;;;;N;PARTIAL LINE BACKWARD;;;;
008D;<control>;Cc;0;BN;;;;;N;REVERSE LINE FEED;;;;
008E;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT TWO;;;;
008F;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT THREE;;;;
0090;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL STRING;;;;
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0092;<control>;Cc;0;BN;;;;;N;PRIVATE USE TWO;;;;
0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;
0094;<control>;Cc;0;BN;;;;;N;CANCEL CHARACTER;;;;
0095;<control>;Cc;0;BN;;;;;N;MESSAGE WAITING;;;;
0096;<control>;Cc;0;BN;;;;;N;START OF GUARDED AREA;;;;
0097;<control>;Cc;0;BN;;;;;N;END OF GUARDED AREA;;;;
0098;<control>;Cc;0;BN;;;;;N;START OF STRING;;;;
0099;<control>;Cc;0;BN;;;;;N;;;;;
009A;<control>;Cc;0;BN;;;;;N;SINGLE CHARACTER INTRODUCER;;;;
009B;<control>;Cc;0;BN;;;;;N;CONTROL SEQUENCE INTRODUCER;;;;
009C;<control>;Cc;0;BN;;;;;N;STRING TERMINATOR;;;;
009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;
009E;<control>;Cc;0;BN;;;;;N;PRIVACY MESSAGE;;;;
009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;;
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{&#x7F;}}
the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only HTML but ok in XML.

Should i be making a pull request on version 2 or on the latest version 1.x branch?


was (Author: lukebutters7):
So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML]
has this to says for XML 1.0 this range is valid:
{quote}
    U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP
(all surrogates, U+FFFE and U+FFFF are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only valid in certain
contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged:
    U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one
C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references
it says:
{quote}
The numeric character reference forms described above are allowed to reference any Unicode
code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters),
and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;<control>;Cc;0;BN;;;;;N;DELETE;;;;
0080;<control>;Cc;0;BN;;;;;N;;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;
0083;<control>;Cc;0;BN;;;;;N;NO BREAK HERE;;;;
0084;<control>;Cc;0;BN;;;;;N;;;;;
0085;<control>;Cc;0;B;;;;;N;NEXT LINE (NEL);;;;
0086;<control>;Cc;0;BN;;;;;N;START OF SELECTED AREA;;;;
0087;<control>;Cc;0;BN;;;;;N;END OF SELECTED AREA;;;;
0088;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION SET;;;;
0089;<control>;Cc;0;BN;;;;;N;CHARACTER TABULATION WITH JUSTIFICATION;;;;
008A;<control>;Cc;0;BN;;;;;N;LINE TABULATION SET;;;;
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
008C;<control>;Cc;0;BN;;;;;N;PARTIAL LINE BACKWARD;;;;
008D;<control>;Cc;0;BN;;;;;N;REVERSE LINE FEED;;;;
008E;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT TWO;;;;
008F;<control>;Cc;0;BN;;;;;N;SINGLE SHIFT THREE;;;;
0090;<control>;Cc;0;BN;;;;;N;DEVICE CONTROL STRING;;;;
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0092;<control>;Cc;0;BN;;;;;N;PRIVATE USE TWO;;;;
0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;
0094;<control>;Cc;0;BN;;;;;N;CANCEL CHARACTER;;;;
0095;<control>;Cc;0;BN;;;;;N;MESSAGE WAITING;;;;
0096;<control>;Cc;0;BN;;;;;N;START OF GUARDED AREA;;;;
0097;<control>;Cc;0;BN;;;;;N;END OF GUARDED AREA;;;;
0098;<control>;Cc;0;BN;;;;;N;START OF STRING;;;;
0099;<control>;Cc;0;BN;;;;;N;;;;;
009A;<control>;Cc;0;BN;;;;;N;SINGLE CHARACTER INTRODUCER;;;;
009B;<control>;Cc;0;BN;;;;;N;CONTROL SEQUENCE INTRODUCER;;;;
009C;<control>;Cc;0;BN;;;;;N;STRING TERMINATOR;;;;
009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;
009E;<control>;Cc;0;BN;;;;;N;PRIVACY MESSAGE;;;;
009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;;
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{&#x7F;}}
the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only HTML but ok in XML.

> PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-2955
>                 URL: https://issues.apache.org/jira/browse/TIKA-2955
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Luke Butters
>            Priority: Major
>         Attachments: 314.pdf
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - Unable to
filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML character:
decimal 147
>  at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
~[Saxon-HE-9.9.0-2.jar:?]
>  at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 which is
not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one doesn't complain
when given the invalid character though, however tika is probably wrong to write out that
character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message