tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Wed, 22 Oct 2014 20:06:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180402#comment-14180402
] 

Tim Allison edited comment on TIKA-1442 at 10/22/14 8:05 PM:
-------------------------------------------------------------

Sorry, ran new eval code on old 1.8.8 batch process.  Will rerun batch process with latest
1.8.8.

For file 272372.pdf, I see this in the Excel file that I posted earlier today:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137)
	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120)
	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast
to org.apache.pdfbox.cos.COSStream
	at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312)
	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
	... 13 more
{noformat}

Should I try to grab more than that?  Or, are you seeing the same thing that I'm seeing in
the Excel file?


was (Author: tallison@mitre.org):
Sorry, ran new eval code on old 1.8.8 batch process.  Will rerun batch process with latest
1.8.8.

For file 27372.pdf, I see this in Excel:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137)
	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120)
	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast
to org.apache.pdfbox.cos.COSStream
	at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312)
	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
	... 13 more
{noformat}

Should I try to grab more than that?  Or, are you seeing the same thing that I'm seeing in
the Excel file?

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message