tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2311) Preserve "x-tika-ooxml" mime value for truncated ooxml files
Date Thu, 13 Apr 2017 19:03:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968088#comment-15968088

Hudson commented on TIKA-2311:

SUCCESS: Integrated in Jenkins build tika-2.x #243 (See [https://builds.apache.org/job/tika-2.x/243/])
TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml (tallison: rev 143efc8d92735099f5077956d8f257aad106321a)
* (edit) tika-app/src/test/java/org/apache/tika/parser/pkg/PackageTest.java
* (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
* (add) tika-test-resources/src/test/resources/test-documents/testWORD_truncated.docx
* (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java

> Preserve "x-tika-ooxml" mime value for truncated ooxml files
> ------------------------------------------------------------
>                 Key: TIKA-2311
>                 URL: https://issues.apache.org/jira/browse/TIKA-2311
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>             Fix For: 2.0, 1.15
> The following is an unintended consequence of TIKA-2212.
> The OOXML parser used to handle {{x-tika-ooxml}}. We have some truncated ooxml files
in our regression corpus.  The previous behavior was:
> 1) ZipPackage detector caught the zip truncation exception and returned "application/zip"
> 2) The mime detector recognized magic and returned {{x-tika-ooxml}}
> 3) The file was then routed to the OOXML parser which didn't wind up doing much with
the content because it hit the zip exception early on, but the final mime type was {{x-tika-ooxml}}.
> The current behavior
> 1) Same detection steps
> 2) However, because the OOXML parser no longer handles {{x-tika-ooxml}}, the file is
handled by the Package Parser, which overwrites the magic-determined mime type, and the new
mime type is {{application/zip}}.
> 3) Some content is extracted because the Package parser handles the zip entries in order
and only throws the exception once it hits the last entry in the zip file.
> Ideally, I'd like to keep the magic-determined mime detection.  Once we can chain parsers,
the user should be able to backoff to the PackageParser, but I don't think this should be
the default behavior.
> One solution would be to create a new mime type that is not the parent of the other ooxml
subtypes, but is itself a leaf subtype, something like: {{x-tika-ooxml-unk}}.
> Any objections/other recommendations?

This message was sent by Atlassian JIRA

View raw message