tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter May (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-847) Add regular expression support to the MagicDetector
Date Thu, 26 Jan 2012 13:42:54 GMT

     [ https://issues.apache.org/jira/browse/TIKA-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peter May updated TIKA-847:
---------------------------

    Attachment: regex_support.patch

Patch updating MagicDetector and associated unit tests to incorporate regular expression support
in the signature file (does not support EOF regular expressions).

This required a slight extension to the freedesktops mime-info to support a type="regex" attribute
in the "match" element.  Do you have an XML schema anywhere for mime-info, as this would also
need updating?

I also noted (what I consider) a minor bug in the while loop at line 315 (https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java#L315)
of MagicDetector, where the offset is not incremented by the number of read bytes.  I have
corrected that in this patch, but I can extract this out as a separate issue if preferred?
                
> Add regular expression support to the MagicDetector
> ---------------------------------------------------
>
>                 Key: TIKA-847
>                 URL: https://issues.apache.org/jira/browse/TIKA-847
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: Andrew Jackson
>              Labels: detection, format
>         Attachments: regex_support.patch
>
>
> Following on from TIKA-86, we would like to add support for regular expressions to the
MagicDetector. This would allow more signatures to be re-used from more sources (e.g. the
file(1) command). As part of the SCAPE Project, we have added this functionality to our own
Tika branch (e.g. https://github.com/openplanets/tika/commit/b8de9e77c8b432788e3f73a4dbccca8ea044b503)
and are working to tidy this up to make a clean patch we can submit here.
> BTW, are there any patch submission guidelines or coding standards we should check our
work against first?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message