tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Meier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2527) Typos in tika-mimetypes.xml
Date Wed, 24 Jan 2018 12:45:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337531#comment-16337531

Andreas Meier commented on TIKA-2527:

I attached a patch to address the mentioned problems.


Furthermore I added three new MIMEType sections for application/x-lz4, Image/x-tga and audio/x-caf.

The Image/x-tga part had to be placed in front of the application/x-123 mime-type recognition,
because the starting bytes overlap in some cases.

The important part of the Image/x-tga recognition is the inner match that searches for the
trailing part

54 52 55 45 56 49 53 49   TRUEVISI
4F 4E 2D 58 46 49 4C 45   ON-XFILE
2E 00                     ..


Is there an easier possibility to search for trailing magic-strings than using a regex?

I thought that a simple regex might be to expensive to recognize Image/x-tga, therefore I
combined the recognition with the basic tga-recognition of the linux magic file.


While testing tika.mimetypes.xml I noticed that I often thought that the matching string already
was correct, when the actual recognition was done by the file-extension. Therefore I had
to remove the fileextensions of my testfiles to validate the matching parts.

To avoid this I suggest to create either a testcase that only takes care of the matches without
taking file-extensions into account or to delete the fileextensions of testfiles to validate
the matchings.

Is there a testcase that does this already?


If you have any questions or suggestions I would be glad to hear them.

> Typos in tika-mimetypes.xml
> ---------------------------
>                 Key: TIKA-2527
>                 URL: https://issues.apache.org/jira/browse/TIKA-2527
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.0, 1.16, 1.17, 1.18
>         Environment: ALL
>            Reporter: Andreas Meier
>            Priority: Minor
>         Attachments: fix-for-TIKA2527-contributed-by-AMeier-Fixed-adpcmmi.patch
> Are these mimetypes in tika-mimetypes.xml
> audio/x-adbcm instead audio/x-adpcm
> {code:xml} <mime-type type="audio/x-adbcm">{code}
> and
> audio/x-dec-adbcm  instead audio/x-dec-adpcm
> {code:xml} <mime-type type="audio/x-dec-adbcm">{code}
> intended?
> Couldn't find these mimetypes.
> Regards
> Andreas

This message was sent by Atlassian JIRA

View raw message