tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-257) Uncorrect mime-type detection for ooxml
Date Mon, 13 Jul 2009 20:49:14 GMT

     [ https://issues.apache.org/jira/browse/TIKA-257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved TIKA-257.

       Resolution: Fixed
    Fix Version/s: 0.4
         Assignee: Jukka Zitting

I found a pretty accurate magic byte pattern (the file name string [Content_Types].xml at
offset 30) for OOXML files. This still doesn't tell whether the document is a spreadsheet,
a presentation or something different, but at least it's enough to allow Tika to correctly
send the document to OOXMLParser for more detailed processing with POI.

I added the byte pattern and made some related adjustments in revision 793696. The above test
case now passes.

Resolving as Fixed.

> Uncorrect mime-type detection for ooxml
> ---------------------------------------
>                 Key: TIKA-257
>                 URL: https://issues.apache.org/jira/browse/TIKA-257
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.4
>            Reporter: Maxim Valyanskiy
>            Assignee: Jukka Zitting
>             Fix For: 0.4
> MimeTypes detects docx (and other office XML documents) as 'application/zip' when file
does not have proper extension:
> $ java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m /home/maxcom/download-tmp/proto.docx
> Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
> resourceName: proto.docx
> $ cat /home/maxcom/download-tmp/proto.docx | java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar
> Content-Type: application/zip
> This breaks text extraction when filename is not known

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message