tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)
Date Sun, 01 Mar 2015 06:56:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341993#comment-14341993

Nick Burch commented on TIKA-289:

There are a few issues with integrating it:
 * Very few of the entries in the file magic list have mimetypes, only descriptions, so we'd
need to manually review each one and search for a mimetype. (I see only 287 different mimetypes,
as compared to the vast number of magic entries)
 * Many of the file magic entries include a little bit of parser logic too, with various bits
of the matching being included in the description string, sometimes lots
 * Some of the matching is actually done with code (much like our container aware detectors),
not the mime magic, see the {{src}} directory for those

The file magic and sourcecode are a very good source of magic patterns, and sometimes also
basic parser logic, but I'm not sure how practical a bulk import would be?

> Add magic byte patterns from file(1)
> ------------------------------------
>                 Key: TIKA-289
>                 URL: https://issues.apache.org/jira/browse/TIKA-289
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Priority: Minor
> As discussed in TIKA-285, the file(1) command comes with a pretty comprehensive set of
magic byte patterns. It would be nice to get those patterns included also in Tika.

This message was sent by Atlassian JIRA

View raw message