tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Severtson (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-578) XMLParser ContentHandler: multiple endDocument calls
Date Wed, 22 Dec 2010 17:32:03 GMT
XMLParser ContentHandler: multiple endDocument calls
----------------------------------------------------

                 Key: TIKA-578
                 URL: https://issues.apache.org/jira/browse/TIKA-578
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: N/A
            Reporter: Scott Severtson


When supplying a ContentHandler to a XMLParser instance, the ContentHandler's .endDocument()
method is called twice; once by the SAXParser created within XMLParser, once explicitly by
XMLParser itself. 

Sample code:
---
InputStream inputStream = ...
XMLParser parser = new DcXMLParser();
ParseContext context = new ParseContext();
Metadata metadata = new Metadata();

DOMResult result = new DOMResult();
TransformerHandler transformerHandler = ((SAXTransformerFactory) SAXTransformerFactory.newInstance()).newTransformerHandler();
transformerHandler.setResult(result);

parser.parse(inputStream, transformerHandler, metadata, context);
---


The following exception is produced:
---
java.util.EmptyStackException
	at java.util.Stack.peek(Stack.java:85)
	at java.util.Stack.pop(Stack.java:67)
	at com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM.endDocument(SAX2DOM.java:143)
	at com.sun.org.apache.xml.internal.serializer.ToXMLSAXHandler.endDocument(ToXMLSAXHandler.java:181)
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endDocument(TransformerHandlerImpl.java:231)
	at org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:212)
	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71)
	...
---

We have worked around the issue temporarily by passing in a ContentHandler that eats the first
.endDocument() call, and allows the second to go through. However, we believe XMLParser should
hide the extraneous .endDocument() call internally.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message