tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Tyler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-391) Intermittent errors detectig xls files
Date Wed, 24 Mar 2010 09:27:31 GMT

    [ https://issues.apache.org/jira/browse/TIKA-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849118#action_12849118

Simon Tyler commented on TIKA-391:

I have had a further look at the nature of the failure to detect the type of
the particular file and still feel it is a bug.

This is an excel (.xls) spreadsheet and I give the detector the correct
filename and correct content content type for it. The detector still fails
to identify it correctly sometimes.

I had a look at the code and the reason is now clear to me and is easily

The getMimeType method searches for a magic match and stops at the first
hit. The search is ordered (based on priority, size and clause). This
particular file matches two detectors (word and excel) which compare
identically - this means the order of them in the SortedSet is undefined,
this is the cause of the problem.

A fix is for getMimeType to return the complete set of matches rather than a
single match and then to use the filename and content-type hints on each
match returning the first that matches either. I have modified the code to
do this and it solves the problem. The hint matching could be improved
further if necessary so that it picks the best match from the set based on
both hints rather than just stopping at the first.


> Intermittent errors detectig xls files
> --------------------------------------
>                 Key: TIKA-391
>                 URL: https://issues.apache.org/jira/browse/TIKA-391
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.6
>            Reporter: Simon Tyler
> I am doing some testing of Tika 0.6 and noticed some odd results for the testEXCEL.xls
file included in the test suite. 
> 100 calls to the following code:
>             is = new BufferedInputStream(new FileInputStream(filename));
>             Metadata metadata = new Metadata();
>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>             String type = tika.detect(is, metadata);
> Results in different matches as application/msword or application/vnd.ms-excel seemingly
at random.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message