tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
Date Wed, 15 Oct 2014 21:28:33 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172978#comment-14172978
] 

Tilman Hausherr commented on TIKA-1442:
---------------------------------------

files that have only junk as text with AR:

661/661834.pdf
565/565010.pdf
248/248787.pdf
979/979474.pdf
831/831528.pdf
638/638488.pdf
878/878499.pdf
503/503035.pdf
289/289669.pdf

file that has a possible virus:
345/345947.pdf (wasn't in the last test set)

files that have an error when opening with AR (although they can be displayed):
092/092919.pdf
435/435321.pdf
995/995773.pdf
078/078278.pdf
210/210260.pdf
219/219789.pdf
230/230877.pdf
268/268554.pdf
367/367594.pdf
392/392154.pdf
475/475121.pdf
477/477047.pdf
551/551464.pdf
615/615614.pdf
707/707505.pdf
714/714002.pdf
738/738627.pdf
819/819127.pdf
101/101819.pdf
359/359872.pdf
523/523690.pdf

Surprisingly, some files with LZW errors do display with AR without an error message. Either
AR keeps quiet about it, or there is still a bug in the LZW decoder. Both could be possible,
AR doesn't show every error, and the PDFBox LZW decoder is [tricky|https://issues.apache.org/jira/issues/?jql=labels%20%3D%20LZW].

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon
as it is ready.  I'm tempted to call this a blocker on Tika 1.7.  Let's use this issue to
carry on the discussion of regression testing (if any further discussion is necessary) or
any other prep that needs to happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message