tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Philippe Ricard (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-391) Intermittent errors detecting xls files
Date Mon, 14 Jun 2010 13:53:14 GMT

    [ https://issues.apache.org/jira/browse/TIKA-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878579#action_12878579
] 

Jean-Philippe Ricard commented on TIKA-391:
-------------------------------------------

I had a similar problem when using the detector with a Microsoft Word document (.doc). Using
the detector multiple times on the same file returns different results. Sometimes it is  application/msword
and sometimes application/x-tika-msoffice.

The problem lies in the compareTo() method of the class Magic. The compareTo() method is broken
and therefore the order of Magic instances in the SortedSet of the class MimeTypes is also
broken which leads to inconsistent mime type detection.

The comparison algorithm is using the toString() method of a Magic in the case that the priority
and the size are the same. Actually the toString() of a Magic returns something like "50/org.apache.tika.detect.MagicDetector@732b3d53".
Since the toString() of MagicDetector in this case is not redefined, the comparison of Magic
returns random results. Instead of relying on the toString() in the compareTo() of a Magic,
using the type (getType()) of a Magic would ensure more consistent results.

Jean-Philippe

> Intermittent errors detecting xls files
> ---------------------------------------
>
>                 Key: TIKA-391
>                 URL: https://issues.apache.org/jira/browse/TIKA-391
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.6
>            Reporter: Simon Tyler
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: MimeTypes.java
>
>
> I am doing some testing of Tika 0.6 and noticed some odd results for the testEXCEL.xls
file included in the test suite. 
> 100 calls to the following code:
>  
>             is = new BufferedInputStream(new FileInputStream(filename));
>  
>             Metadata metadata = new Metadata();
>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>  
>             String type = tika.detect(is, metadata);
>  
> Results in different matches as application/msword or application/vnd.ms-excel seemingly
at random.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message