tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Rusin (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-185) XML files with (unsatisfied) SYSTEM entities can not be indexed
Date Thu, 08 Jan 2009 15:42:59 GMT

     [ https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andrzej Rusin updated TIKA-185:

    Attachment: xmlTest2.xml

Minimal test case for the issue.

xmlTest.xml: when blah.xml is present at time of extracting (with any content, eg "blah"),
everything works OK, otherwise not.

xmlTest2.xml: always fails because chrome protocol is not defined. However there exist xml
files using this protocol, eg in Firefox plugins.

> XML files with (unsatisfied) SYSTEM entities can not be indexed
> ---------------------------------------------------------------
>                 Key: TIKA-185
>                 URL: https://issues.apache.org/jira/browse/TIKA-185
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2
>            Reporter: Andrzej Rusin
>            Priority: Minor
>         Attachments: xmlTest.xml, xmlTest2.xml
> When trying to extract an XPI file (Firefox extenstion, which probably is not a best
candidate for extract) I got the below exception.
> It was caused by SYSTEM entities refering the chrome:// protocol.
> However, obviously any XML file that contains SYSTEM entities which can not be accessed
at the time of extraction will not be extracted properly.
> Here is the stack trace:
> java.net.MalformedURLException: unknown protocol: chrome
>    at java.net.URL.<init>(URL.java:574)
>    at java.net.URL.<init>(URL.java:464)
>    at java.net.URL.<init>(URL.java:413)
>    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>    at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93)
>    at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message