tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1489) PDF Text extraction without permission
Date Mon, 02 Mar 2015 16:56:05 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tim Allison updated TIKA-1489:
    Attachment: testPDF_no_extract_yes_accessibility_owner_user.pdf

This patch keeps Tika's default behavior.  Users who want to check permissions can configure
that via {{PDFParserConfig}}.

I created a {{core}} level metadata object {{AccessPermissions}} to capture the AccessPermission
metadata.  This is primarily derived from the model for PDFs, but we can add a few more once
we start extracting this information from MSOffice documents.

If we find other document formats that allow "don't extract content", we can move {{AccessChecker}}
to core and perhaps move that configuration to {{TikaConfig}}, but until we find other formats
that require this, I think it is better to keep it closely tied to PDFs.  I'm open to other
ideas, though.

If there are no objections, I'll commit this in a few days.

I am extremely grateful to [~tilman] for opening this issue and for his great patience in
helping me to understand PDF's access permission model.

> PDF Text extraction without permission
> --------------------------------------
>                 Key: TIKA-1489
>                 URL: https://issues.apache.org/jira/browse/TIKA-1489
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Tilman Hausherr
>         Attachments: TIKA-1489_v1.patch, testPDF_no_extract_no_accessibility_owner_empty.pdf,
testPDF_no_extract_no_accessibility_owner_user.pdf, testPDF_no_extract_yes_accessibility_owner_empty.pdf,
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text extraction
permission works. The permissions in PDF files are only enforced by the application (i.e.
PDFBox), i.e. the text information isn't stored separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call used bypasses
the permission checking.

This message was sent by Atlassian JIRA

View raw message