tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Tyler <sty...@mimecast.net>
Subject Re: Detector results for Excel formats
Date Tue, 23 Mar 2010 10:43:54 GMT

I have had a further look at the nature of the failure to detect the type of
the particular file and still feel it is a bug.

This is an excel (.xls) spreadsheet and I give the detector the correct
filename and correct content content type for it. The detector still fails
to identify it correctly sometimes.

I had a look at the code and the reason is now clear to me and is easily
fixed.

The getMimeType method searches for a magic match and stops at the first
hit. The search is ordered (based on priority, size and clause). This
particular file matches two detectors (word and excel) which compare
identically - this means the order of them in the SortedSet is undefined,
this is the cause of the problem.

A fix is for getMimeType to return the complete set of matches rather than a
single match and then to use the filename and content-type hints on each
match returning the first that matches either. I have modified the code to
do this and it solves the problem. The hint matching could be improved
further if necessary so that it picks the best match from the set based on
both hints rather than just stopping at the first.

Simon


On 18/03/2010 19:16, "Alex Ott" <alexott@gmail.com> wrote:

> Re
> 
> Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
>  KK> Thanks, Alex - great input.
> 
>  KK> We'd run into similar problems at Krugle, with determining the correct
> mime-type for
>  KK> source code. Sometimes you wind up needing to parse the  code to make the
> correct choice.
> 
>  KK> We had extended the Nutch mime-type detector to support both regex and
> post-processing to
>  KK> handle this disambiguation.
> 
>  KK> But that was hard-coded for a handful of known edge cases.
> 
>  KK> One possible way for this to work with the current XML-based mime-type
> definitions is to
>  KK> have a "here's the name of the class you'll have to  instantiate and run
> to make the final
>  KK> call"
> 
> Yes - I have something like in my own media type detector (for data leak
> prevention) - when signature (either CFBF or Zip) is found, then
> corresponding code is called, that return constant, that correspond to some
> type (I need to implement logic inside my own code, because sometimes rules
> are to complex to express them in simplier rules).   At the end I have
> something like:
> 
> if CFBF Signature then get type from CFBF and if type == NNN then mimetype =
> word/excel/...
> 
> But i have special lisp-like language to describe complex checks...
> 
>  KK> -- Ken
> 
>  KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:
> 
>>> 
>>> I'm not sure, that this is actual for Tika, but I looked into its mime
>>> database and see problem in definitions - both types uses common OLE (MS
>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by
>>> dozens of file formats.  To perform correct mime type detection of CFBF
>>> files, you need to analyze it (with POI?) and detect which objects are
>>> located at top-directory (directly under Root Directory entry) of the OLE
>>> file.  For word this is object WordDocument, while for Excel this is
>>> Workbook or Book.  Simple search for corresponding names will not help,
>>> because all these objects could be embedded into other documents via OLE.
>>> 
>>> Other details you can find in official Microsoft Documentation
>>> 
>>> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
>>> ST> Hi,
>>> 
>>> ST> I haven't seen any responses to this. Does anyone know why I should be
>>> ST> seeing such unpredictable behaviour?
>>> 
>>> ST> Simon
>>> 
>>> ST> On 15/03/2010 09:27, "Simon Tyler" <styler@mimecast.net> wrote:
>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am doing some testing of Tika 0.6 and noticed some odd results for
the
>>>>> testEXCEL.xls file included in the test suite.
>>>>> 
>>>>> 100 calls to the following code:
>>>>> 
>>>>>             is = new BufferedInputStream(new FileInputStream(filename));
>>>>> 
>>>>>            Metadata metadata = new Metadata();
>>>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>>> 
>>>>>            String type = tika.detect(is, metadata);
>>>>> 
>>>>> Results in different matches as application/msword or
>>>>> application/vnd.ms-excel seemingly at random.
>>>>> 
>>>>> Is this expected? Is there a way to mitigate it?
>>>>> 
>>>>> Simon
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> With best wishes, Alex Ott, MBA
>>> http://alexott.blogspot.com/        http://alexott.net/
>>> http://alexott-ru.blogspot.com/
> 
>  KK> --------------------------------------------
>  KK> Ken Krugler
>  KK> +1 530-210-6378
>  KK> http://bixolabs.com
>  KK> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
> 




Mime
View raw message