tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code
Date Thu, 04 Jun 2015 05:46:38 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572201#comment-14572201
] 

Hudson commented on TIKA-1634:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.7 #728 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/728/])
Fix for TIKA-1634 Detecting problem with Matlab source code contributed by Jihyun Oh <mail2jhoh@gmail.com>
this closes #49. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1683464)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> Detecting problem with Matlab source code
> -----------------------------------------
>
>                 Key: TIKA-1634
>                 URL: https://issues.apache.org/jira/browse/TIKA-1634
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.8
>            Reporter: Ji-Hyun Oh
>            Assignee: Chris A. Mattmann
>            Priority: Trivial
>              Labels: earthcube
>             Fix For: 1.9
>
>         Attachments: BARCAST_MainCode.m, Initial_Vals_Maker.m, custom-mimetypes.xml,
tika-mimetypes.xml, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, which is .m.
Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function [" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> However, Matlab codes does not always start with "function [“. Therefore, some Matlab
codes are detected as text/x-bojcsrc. Based on the source codes collected from NOAA Paleoclimatology
Software Resources, many Matlab codes have match value like these (problematic files are attached
as an example):
> <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function" type="string" offset="0"/>
>       <match value="%" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> Conducted several detecting tests by using different Matlab packages obtained from NOAA
Paleoclimatology Software Resources, with/without Custom-mimtypes.xml. Results are attached.
As a results, total 103 Matlab files are detected correctly with custom-mimetypes.xml, while
 42 Matlab files are detected as Matlab files without custom-mimetypes.xml (= only with current
match value). However, this match value for Matlab source code could be only common in Paleoclimatology
community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message