tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-86) Support magic(5) files
Date Mon, 16 Jan 2012 16:52:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187018#comment-13187018
] 

Andrew Jackson commented on TIKA-86:
------------------------------------

We've done some work in this area, and noticed that other identification tools (including
file) use a wider range of matching methods than Tika currently supports, e.g. regular expressions.
To this end, we've extended Tika so that it can support RegEx magic (see e.g. this commit
on our GitHub repo https://github.com/openplanets/tika/commit/b8de9e77c8b432788e3f73a4dbccca8ea044b503).
We'd be happy to tidy this code up and submit it here if being able to re-use RegEx magic
from other tools is of interest to the core Tika project.

However, to get back to the point, I agree that simply having a parser for file magic would
not work as porting the magic is necessarily a manual process. Even when there is a MIME type,
you can't reliably tell which bits of the magic are identifying the format and which bits
are doing 'set-up' or extracting properties. This implies that this feature request should
be turned down.

                
> Support magic(5) files
> ----------------------
>
>                 Key: TIKA-86
>                 URL: https://issues.apache.org/jira/browse/TIKA-86
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Jukka Zitting
>
> Tika should have a parser for the magic(5) file format used by the file(1) command. Then
we could use existing magic rules from places like http://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/magic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message