tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7
Date Thu, 09 Oct 2014 22:35:35 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr updated TIKA-1419:
----------------------------------
    Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

Thank you [~tallison@apache.org], here's the result of some manual analysing. The good news
is that I found a few improvements, and only two regressions, and no case of "smaller results"
like with 1.8.7. Here's some suggestions how the automatic analysis could be improved:

- dictionary, or maybe just count a few common english words with at least three characters
( https://en.wikipedia.org/wiki/Most_common_words_in_English ), i.e. to ignore files that
are mostly made of trash (although the trash changes)
- deleting files from the test set that are known to be corrupt, or won't get any useful text
even in adobe reader, so that the manual investigation isn't done each time.

I analysed only cases where there were no exceptions. Within the next few days, I'll investigate
some of the cases where there are still exceptions, however most of these are corrupt files,
that even Adobe Reader doesn't display.

> Upgrade to PDFBox 1.8.7
> -----------------------
>
>                 Key: TIKA-1419
>                 URL: https://issues.apache.org/jira/browse/TIKA-1419
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx,
pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.zip
>
>
> Will run against govdocs1 early next week and then upgrade if no major regressions are
found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message