tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bastian Mathes (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-778) NullPointerException in tika-app, parsing PDF content
Date Tue, 15 Nov 2011 16:06:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150570#comment-13150570
] 

Bastian Mathes commented on TIKA-778:
-------------------------------------

Calling the extraction directly on the command line actually works (with or without --html),
so the issue is probably not as important that I thought, it is just that opening it from
within the Tika application causes this exception (in 1.0, not in 0.10). I send you a PDF
via mail.
                
> NullPointerException in tika-app, parsing PDF content
> -----------------------------------------------------
>
>                 Key: TIKA-778
>                 URL: https://issues.apache.org/jira/browse/TIKA-778
>             Project: Tika
>          Issue Type: Bug
>          Components: gui, parser
>    Affects Versions: 1.0
>            Reporter: Bastian Mathes
>
> I try to extract text from some pdf files with the tika app. In version 0.10 the error

> ERROR - Error: Could not parse predefined CMAP file for '--UCS2'
> is printed on the command line, but text extraction works and is correct.
> In version 1.0 I get the same error message on the command line, but also receive an
exception and no text is extracted:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@62bc36ff
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
> 	at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
> 	at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
> 	at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
> 	at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
> 	at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
> 	at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
> 	at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
> 	at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
> 	at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
> 	at java.awt.Component.processMouseEvent(Component.java:6288)
> 	at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
> 	at java.awt.Component.processEvent(Component.java:6053)
> 	at java.awt.Container.processEvent(Container.java:2041)
> 	at java.awt.Component.dispatchEventImpl(Component.java:4651)
> 	at java.awt.Container.dispatchEventImpl(Container.java:2099)
> 	at java.awt.Component.dispatchEvent(Component.java:4481)
> 	at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577)
> 	at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
> 	at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
> 	at java.awt.Container.dispatchEventImpl(Container.java:2085)
> 	at java.awt.Window.dispatchEventImpl(Window.java:2478)
> 	at java.awt.Component.dispatchEvent(Component.java:4481)
> 	at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643)
> 	at java.awt.EventQueue.access$000(EventQueue.java:84)
> 	at java.awt.EventQueue$1.run(EventQueue.java:602)
> 	at java.awt.EventQueue$1.run(EventQueue.java:600)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
> 	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
> 	at java.awt.EventQueue$2.run(EventQueue.java:616)
> 	at java.awt.EventQueue$2.run(EventQueue.java:614)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
> 	at java.awt.EventQueue.dispatchEvent(EventQueue.java:613)
> 	at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
> 	at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
> 	at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
> 	at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
> 	at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
> 	at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
> Caused by: java.lang.NullPointerException
> 	at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
> 	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
> 	at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> 	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
> 	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:216)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:112)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 43 more
> I tried the same pdf files (and can switch forth and back between version 0.10 and 1.0,
this behavior is stable) and it looks like the exact same pdfbox version is inside the tika-app-0.10.jar
and tika-app-1.0.jar. It would be great if version 1.0 could do what 0.10 can. Sorry that
I cannot provide the pdf.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message