tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-3023) Text files starting with MOVI are detected as X-SGI-Movie
Date Thu, 09 Jan 2020 10:02:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011634#comment-17011634

Nick Burch commented on TIKA-3023:

Assuming that the byte after MOVI is part of a version or length field, perhaps safest to
check for 0x00 or 0x01 or 0x02? That shouldn't trigger on text, but probably would help with
the detection of the real thing. (Some sort of "not a printable ascii field" is probably a
step too far and probably wouldn't help much either!)

> Text files starting with MOVI are detected as X-SGI-Movie
> ---------------------------------------------------------
>                 Key: TIKA-3023
>                 URL: https://issues.apache.org/jira/browse/TIKA-3023
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.23
>         Environment: Issue recreated on
> Windows 10 Professional 64bit running the runnable Jar
> Ubuntu 16.04.6 LTS running Tika-Python
>            Reporter: Steve
>            Priority: Minor
>         Attachments: capitalmovie.txt
> If a plaintext file starts with "MOVI" Tika labels it as an SGI Movie.
> The hex conversion for MOVI is 4D 4F 56 49 which is the same as the header for the SGI
Movie file format
> [https://reposcope.com/mimetype/video/x-sgi-movie]
> This SGI format isn't supported so any information from a text file starting like this
would be lost. I've attached a simple file that should recreate the problem.

This message was sent by Atlassian Jira

View raw message