tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter May (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-847) Add regular expression support to the MagicDetector
Date Thu, 26 Jan 2012 13:42:54 GMT

     [ https://issues.apache.org/jira/browse/TIKA-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Peter May updated TIKA-847:

    Attachment: regex_support.patch

Patch updating MagicDetector and associated unit tests to incorporate regular expression support
in the signature file (does not support EOF regular expressions).

This required a slight extension to the freedesktops mime-info to support a type="regex" attribute
in the "match" element.  Do you have an XML schema anywhere for mime-info, as this would also
need updating?

I also noted (what I consider) a minor bug in the while loop at line 315 (https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java#L315)
of MagicDetector, where the offset is not incremented by the number of read bytes.  I have
corrected that in this patch, but I can extract this out as a separate issue if preferred?
> Add regular expression support to the MagicDetector
> ---------------------------------------------------
>                 Key: TIKA-847
>                 URL: https://issues.apache.org/jira/browse/TIKA-847
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: Andrew Jackson
>              Labels: detection, format
>         Attachments: regex_support.patch
> Following on from TIKA-86, we would like to add support for regular expressions to the
MagicDetector. This would allow more signatures to be re-used from more sources (e.g. the
file(1) command). As part of the SCAPE Project, we have added this functionality to our own
Tika branch (e.g. https://github.com/openplanets/tika/commit/b8de9e77c8b432788e3f73a4dbccca8ea044b503)
and are working to tidy this up to make a clean patch we can submit here.
> BTW, are there any patch submission guidelines or coding standards we should check our
work against first?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message