tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-267) encrypted pdf files aren't handled properly
Date Thu, 09 Dec 2010 02:06:02 GMT

    [ https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969608#action_12969608
] 

Nick Burch commented on TIKA-267:
---------------------------------

The issue with the canExtractContent check is that you will often end up with garbage metadata
(eg TIKA-389)

I think we want to always decrypt if we can, and that certainly fixes things for the test
documents I have. However, if you have a document that this then breaks, could you please
upload it? (At the moment, we don't have a unit test for your use case so I can't be sure
what change this will have)

> encrypted pdf files aren't handled properly
> -------------------------------------------
>
>                 Key: TIKA-267
>                 URL: https://issues.apache.org/jira/browse/TIKA-267
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux 8.10, JRE 1.5
>            Reporter: Sascha Szott
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.5
>
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents, I realized
an odd behaviour of Tika when processing encrypted documents (those documents that restrict
the execution of specific actions, e.g. editing or printing). To extract content from an encrypted
pdf document you do not have to decrypt the document in every case. For instance, when creating
an (encrypted) pdf document the author can decide to allow content extraction without the
need of providing a password. Unfortunately, Tika's pdf parser isn't aware of this at the
moment. Therefore, I suggest a minor change inside the parse method in class org.apache.tika.parser.pdf.PDFParser
by introducing an additional check ("is copying allowed") before trying to decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
>   PDDocument pdfDocument = PDDocument.load(stream);
>   try {
>     //decrypt document only if copying is not allowed
>     if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
>       if (pdfDocument.isEncrypted()) {
>         try {
>           pdfDocument.decrypt("");
>         } catch (Exception e) {
>           // Ignore
>         }
>       }
>     }
>     ...
> Another solution to this problem would be to eliminate the "isEncrypted" check since
PDFBox seems to handle the extraction of content out of encrypted documents correctly (and
throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message