tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Detector results for Excel formats
Date Thu, 18 Mar 2010 19:07:14 GMT
Thanks, Alex - great input.

We'd run into similar problems at Krugle, with determining the correct  
mime-type for source code. Sometimes you wind up needing to parse the  
code to make the correct choice.

We had extended the Nutch mime-type detector to support both regex and  
post-processing to handle this disambiguation.

But that was hard-coded for a handful of known edge cases.

One possible way for this to work with the current XML-based mime-type  
definitions is to have a "here's the name of the class you'll have to  
instantiate and run to make the final call"

-- Ken

On Mar 18, 2010, at 11:21am, Alex Ott wrote:

>
> I'm not sure, that this is actual for Tika, but I looked into its mime
> database and see problem in definitions - both types uses common OLE  
> (MS
> CFBF - Microsoft Compound File Binary Format) signature, that also  
> used by
> dozens of file formats.  To perform correct mime type detection of  
> CFBF
> files, you need to analyze it (with POI?) and detect which objects are
> located at top-directory (directly under Root Directory entry) of  
> the OLE
> file.  For word this is object WordDocument, while for Excel this is
> Workbook or Book.  Simple search for corresponding names will not  
> help,
> because all these objects could be embedded into other documents via  
> OLE.
>
> Other details you can find in official Microsoft Documentation
>
> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
> ST> Hi,
>
> ST> I haven't seen any responses to this. Does anyone know why I  
> should be
> ST> seeing such unpredictable behaviour?
>
> ST> Simon
>
> ST> On 15/03/2010 09:27, "Simon Tyler" <styler@mimecast.net> wrote:
>
>>>
>>> Hi,
>>>
>>> I am doing some testing of Tika 0.6 and noticed some odd results  
>>> for the
>>> testEXCEL.xls file included in the test suite.
>>>
>>> 100 calls to the following code:
>>>
>>>             is = new BufferedInputStream(new  
>>> FileInputStream(filename));
>>>
>>>            Metadata metadata = new Metadata();
>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>
>>>            String type = tika.detect(is, metadata);
>>>
>>> Results in different matches as application/msword or
>>> application/vnd.ms-excel seemingly at random.
>>>
>>> Is this expected? Is there a way to mitigate it?
>>>
>>> Simon
>>>
>
>
>
>
>
> -- 
> With best wishes, Alex Ott, MBA
> http://alexott.blogspot.com/        http://alexott.net/
> http://alexott-ru.blogspot.com/

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message