tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-185) XML files with (unsatisfied) SYSTEM entities can not be indexed
Date Fri, 09 Jan 2009 00:40:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662195#action_12662195
] 

Chris A. Mattmann commented on TIKA-185:
----------------------------------------

+1.

I think we open up a whole horde of other tangential issues if we are dealing with external
validation and protocols. Let's apply KISS principles here. Perhaps this can be attacked at
the protocol layer, in some other fashion?

Cheers, 
Chris


> XML files with (unsatisfied) SYSTEM entities can not be indexed
> ---------------------------------------------------------------
>
>                 Key: TIKA-185
>                 URL: https://issues.apache.org/jira/browse/TIKA-185
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2
>            Reporter: Andrzej Rusin
>            Priority: Minor
>         Attachments: xmlTest.xml, xmlTest2.xml
>
>
> When trying to extract an XPI file (Firefox extenstion, which probably is not a best
candidate for extract) I got the below exception.
> It was caused by SYSTEM entities refering the chrome:// protocol.
> However, obviously any XML file that contains SYSTEM entities which can not be accessed
at the time of extraction will not be extracted properly.
> Here is the stack trace:
> java.net.MalformedURLException: unknown protocol: chrome
>    at java.net.URL.<init>(URL.java:574)
>    at java.net.URL.<init>(URL.java:464)
>    at java.net.URL.<init>(URL.java:413)
>    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
>    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>    at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93)
>    at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message