tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml
Date Mon, 26 May 2014 08:02:01 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008652#comment-14008652

Hudson commented on TIKA-1292:

FAILURE: Integrated in tika-trunk-jdk1.7 #3 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/3/])
TIKA-1292 If there is more than one mime magic which matches at the highest priority, keep
track and then try to pick based on filename or type hint later (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1596612)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java
* /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
Set an explicit priority on the OLE2 match, remove two MS Word matches which were OLE2 ones
in disguise, and add an intermediate staroffice parent on the staroffice types. Helps with
TIKA-1292 testing (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1596611)
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Add a disabled unit test for TIKA-1292, which when working will ensure that if we have two
matching magics at the same priority, the name is used to specialise if possible, first defined
if not (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1596593)
* /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* /tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml
Container formats with specific, low-false-positive magic matches need a slightly higher priority,
so that they don't accidently end up being matched based on the contents of the container
near the start of the file. Partly solves TIKA-1292. This closes #6 github pull request (nick:
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
Add some notes on entries, to help people maintaining the file know what to do, related to
TIKA-1292 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1596586)
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

> Inconsistent priorities in bundled tika-mimetypes.xml
> -----------------------------------------------------
>                 Key: TIKA-1292
>                 URL: https://issues.apache.org/jira/browse/TIKA-1292
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.5
>            Reporter: Cservenak, Tamas
>             Fix For: 1.6
> It seems that mime-type priorities are a bit inconsistent in the tika-core bundled tika-mimetypes.xml
> Few examples:
> * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
vs [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
both are similar "containers" archive formats (structured, having entries), having distinct
file extensions ("zip" vs "7z" globs), still priorities are 40 and 50 respectively.
> * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
vs [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
not quite related MIME types, having same priority of 40. But ZIP files can be "uncompressed"
(meaning entries are mostly "concatenated", and their content, if plaintext, is readable).
Hence, having an "uncompressed" ZIP (or any subclass like JAR) file that contains HTML files
zipped up might/will be detected as HTML, which is wrong. 
> And this is what happens in Nexus that uses Tika under the hud for "content" validation,
basically using MIME magic detection provided by Tika Detector: the Java JAR {{com.intellij:annotations:7.0.3}}
([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is being detected as
{{text/html}} instead of (expected) {{application/java-archive}}.
> Reason is following: the JAR file is zipped up in "uncompressed" zip format, and among
few annotations it also contains one HTML file entry (the license I guess). Since both MIME
types have same priority (40), I guess tika "randomly" chooses the {{text/html}}.
> Original Nexus issue
> https://issues.sonatype.org/browse/NEXUS-6560
> At Nexus issue there is a GH Pull Request that solves the problem for us (by raising
{{application/zip}} priority to 41.
> But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably -- priority
inconsistencies, like that of zip vs 7z mentioned above.
> Note: this happens when using tika-core solely on classpath and using it for MIME magic
detection. Interestingly, when the tika-parsers (with it's all dependencies) are added to
classpath, Tika will properly figure out that the artifact is {{application/java-archive}}.
Still, our use case in Nexus requires the MIME magic detection only, so we do not use tika-parsers,
nor we would like to do so.
> Sample project to reproduce
> https://github.com/cstamas/tika-1292

This message was sent by Atlassian JIRA

View raw message