tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Wed, 22 Oct 2014 22:12:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180636#comment-14180636
] 

Tilman Hausherr commented on TIKA-1442:
---------------------------------------

Which are the top10words? I ask because 554/554384.pdf has only five of them.

I've now found a strategy... First, I've added a new column that the new word count by the
old word count. If the result is smaller than 1, treat it as suspicious - but not, if both
have zero top10words. The file I mention has 5 (0 before) so the file has improved, and it
is not a regression.

Another strategy would be to look for files with less top10words, this would likely be a regression.
Will probably add a column with a formula for that one.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message