tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1457) NullPointerException in tika-app, parsing PDF content
Date Mon, 27 Oct 2014 17:41:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185479#comment-14185479
] 

Tim Allison commented on TIKA-1457:
-----------------------------------

Might make sense to test against Tika 1.6 or even 1.7-SNAPSHOT.  Download 1.6 from the regular
download [site|http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.6.jar].  You should be
able to get a snapshot of 1.7 [here|http://repository.apache.org/content/groups/snapshots/org/apache/tika/]
although I'm getting timed out at the moment.

If the file works in 1.6, you'll get the fix in the next 4.x release of Solr (I think).  If
the file works in 1.7, open an issue on Solr to upgrade to that when it becomes available.

> NullPointerException in tika-app, parsing PDF content
> -----------------------------------------------------
>
>                 Key: TIKA-1457
>                 URL: https://issues.apache.org/jira/browse/TIKA-1457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: OS - Linux Centos 6.5
> Web APP - Tomcat6
> Using Solr 4.10
> Tika Jar
>           * tika-core-1.5.jar
>           * tika-parsers-1.5.jar
>           * tika-xmp-1.5.jar
>           * pdfbox-1.8.4.jar
>            Reporter: Tadeu Alves
>              Labels: bug, parser, solr, tika,text-extraction
>             Fix For: 1.6
>
>
> When I try to extract text from some pdf files with the tika app 1.5
> null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01
> 	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 	at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
> 	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@52cfcf01
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 	... 19 more
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 0
> 	at java.lang.String.charAt(String.java:658)
> 	at org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:680)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
> 	at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:754)
> 	at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:797)
> 	at org.apache.pdfbox.pdmodel.PDDocumentInformation.getModificationDate(PDDocumentInformation.java:232)
> 	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:176)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:142)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message