tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antoni Mylka (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-791) Fix the detection of protected OOXML files
Date Fri, 25 Nov 2011 15:51:40 GMT

     [ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Antoni Mylka updated TIKA-791:

    Attachment: tika-791.zip

A ZIP file with the patch and some test documents. They differ from the ones in test-documents
folder in that they're are protected by a non-default password. The protectedFile.xlsx for
instance ins protected with a default password. I made those example files myself.
> Fix the detection of protected OOXML files
> ------------------------------------------
>                 Key: TIKA-791
>                 URL: https://issues.apache.org/jira/browse/TIKA-791
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Windows 7 64 bit
>            Reporter: Antoni Mylka
>         Attachments: tika-791.zip
> TIKA-437 patch allowed Tika to work with OOXML files protected with the default VelvetSweatshop
password. I feel there is room for improvement.
> # The POIFSContainerDetector lies when it sees such a file. It should be able to mark
it as x-tika-ooxml
> # The OOXMLParser can't work with such a file. It should:
> ## If it's protected with the default password - it should be decrypted and processed
> ## If it's protected with a non-default password - the file should be marked as protected,
no weird exceptions should appear.
> Therefore I'd like to add an 'if' to POIFSContainerDetector which returns x-tika-ooxml,
and some code to OOXMLParser, which would be similar to the code currently residing in OfficeParser.
After this improvement both the OfficeParser and the OOXMLParser will treat such files in
the same way.
> When I have that, I can add a hack in my application, which will say "If the type is
x-tika-ooxml and the name-based detection is a specialization of ooxml, then use the name-based
detection". This will be a workaround for the fact that in MimeTypes, magic always trumps
the name. With that, the encrypted DOCX files will appear with the normal DOCX mimetype in
my app.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message