tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2935) MP4 content type identified as application/mp4 rather than video/mp4
Date Tue, 17 Sep 2019 13:38:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931467#comment-16931467

Nick Burch commented on TIKA-2935:

The MP4 format is a container format, and those really do need a dedicated container-aware
detector to do detection correctly. Tika provides ones (in the parsers package) for formats
like ZIP and OLE2. For media formats, we have an Ogg one (for complex historical reasons in
a third party repo that's pulled in)

We don't currently have anything like that for MP4. Instead, all we have is a check of the
type of the first atom, via mime magic. If that's a well known one, we return the specific
type, otherwise the general one

If this is something you'd like to work on, great! Please take a look at [https://github.com/Gagravarr/VorbisJava/blob/master/tika/src/main/java/org/gagravarr/tika/OggDetector.java]
for an example of how to do media container-based detection.

Tika has a bit of mp4 detection logic in the parser class, see MP4Parser. Probably you'd need
to refactor that out to a new Detector, to be re-used by the parser, then add additional logic
to check the streams for cases where the FileTypeBox atom is missing on inconclusive

> MP4 content type identified as application/mp4 rather than video/mp4
> --------------------------------------------------------------------
>                 Key: TIKA-2935
>                 URL: https://issues.apache.org/jira/browse/TIKA-2935
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, mime
>    Affects Versions: 1.21
>            Reporter: Steven Baskin
>            Priority: Minor
>         Attachments: Smile_Doctors_-_Team_Members.mp4
> Im currently trying to use Tika detector to identify the content type of the file attached
to this ticket. Using both the default detector with default config in java and the {{tika-app-1.21
}}jar the type is being returned as {{application/mp4}}.
> According to [https://tools.ietf.org/html/rfc4337]::
> {quote}Selection of MIME Types for MP4 Files
> The MIME types to be assigned to MP4 files are selected according to
>  the contents. Basic guidelines for selecting MIME types are as
>  follows:
> a) if the file contains neither visual nor audio presentations, but
>  only, for example, MPEG-J or MPEG-7, use application/mp4;
> b) for all other files, including those that have MPEG-J, etc., in
>  addition to video or audio streams, video/mp4 should be used;
>  however:
> c) for files with audio but no visual aspect, including those that
>  have MPEG-J, etc., in addition to audio streams, audio/mp4 may be
>  used.
> {quote}
> As the file has both video and audio components, it seems this file type should be identified
as {{video/mp4.}}
> I was hoping to get some help in working out whether this a bug that can be fixed or
if there is a problem on my end.

This message was sent by Atlassian Jira

View raw message