tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Fri, 10 Oct 2014 17:48:37 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167194#comment-14167194

Tilman Hausherr commented on TIKA-1442:

Do you want the junk list in some format? Just the six digits, or the directory too?

Because this manual checking takes a lot of time, I'm planning to download the entire directory
and store the PDFs only.

We should agree on criteria for exclusion. Suggestion:
- files that at some place don't display with Adobe Reader (this applies to most, if not all
the files that have exceptions with LZW or Flate)
- files that do display, but have only junk when doing copy & paste in Adobe Reader

Re 1.8.8 yes I'm obviously unhappy with 1.8.7. But the token comparison should first be improved
so that we're really sure not to have major regressions. (Although I'm optimistic based on
your tests that show only one smaller regression)

> Upgrade to PDFBox 1.8.8
> -----------------------
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.

This message was sent by Atlassian JIRA

View raw message