tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-185) XML files with (unsatisfied) SYSTEM entities can not be indexed
Date Fri, 09 Jan 2009 00:20:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662187#action_12662187
] 

Jukka Zitting commented on TIKA-185:
------------------------------------

We probably should prevent Tika and the parsers it invokes from trying to access any external
entities when parsing XML (or any other file formats for that matter). The only input for
content extraction should be the input stream and input metadata passed to the parse() method.

> XML files with (unsatisfied) SYSTEM entities can not be indexed
> ---------------------------------------------------------------
>
>                 Key: TIKA-185
>                 URL: https://issues.apache.org/jira/browse/TIKA-185
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2
>            Reporter: Andrzej Rusin
>            Priority: Minor
>         Attachments: xmlTest.xml, xmlTest2.xml
>
>
> When trying to extract an XPI file (Firefox extenstion, which probably is not a best
candidate for extract) I got the below exception.
> It was caused by SYSTEM entities refering the chrome:// protocol.
> However, obviously any XML file that contains SYSTEM entities which can not be accessed
at the time of extraction will not be extracted properly.
> Here is the stack trace:
> java.net.MalformedURLException: unknown protocol: chrome
>    at java.net.URL.<init>(URL.java:574)
>    at java.net.URL.<init>(URL.java:464)
>    at java.net.URL.<init>(URL.java:413)
>    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>    at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93)
>    at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message