tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Fri, 24 Oct 2014 19:23:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183338#comment-14183338
] 

Tim Allison edited comment on TIKA-1442 at 10/24/14 7:22 PM:
-------------------------------------------------------------

Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original
1.8.6 output.  When I recently reran with the latest Tika trunk, I got the same number of
metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago).
 All the problematic files have attachments. 

I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last
few weeks, was there a time when we were extracting metadata from images, but now we're not?

For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total
of 40 metadata values for the full document.  Last week, when I ran Tika, there were 160,
metadata values.
{noformat}
{"Content-Length":"5970","Content-Type":"image/jpeg",
"X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],
"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg",
"tika:embedded_resource_path":"224644.pdf/arrow.jpg"},
{"Content-Length":"5970","Content-Type":"image/jpeg",
"X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],
"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg",
"tika:embedded_resource_path":"224644.pdf/arrow.jpg"}]
{noformat} 

In short, [~tilman], I don't think this is a PDFBox issue.


was (Author: tallison@mitre.org):
Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original
1.8.6 output.  When I recently reran with the latest Tika trunk, I got the same number of
metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago).
 All the problematic files have attachments. 

I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last
few weeks, was there a time when we were extracting metadata from images, but now we're not?

For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total
of 40 metadata values for the full document.  Last week, when I ran Tika, there were 160,
metadata values.
{noformat}
{"Content-Length":"5970","Content-Type":"image/jpeg","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg","tika:embedded_resource_path":"224644.pdf/arrow.jpg"},{"Content-Length":"5970","Content-Type":"image/jpeg","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg","tika:embedded_resource_path":"224644.pdf/arrow.jpg"}]
{noformat} 

In short, [~tilman], I don't think this is a PDFBox issue.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx,
pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message