tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <alex...@gmail.com>
Subject Re: Detector results for Excel formats
Date Thu, 18 Mar 2010 18:21:37 GMT

I'm not sure, that this is actual for Tika, but I looked into its mime
database and see problem in definitions - both types uses common OLE (MS
CFBF - Microsoft Compound File Binary Format) signature, that also used by
dozens of file formats.  To perform correct mime type detection of CFBF
files, you need to analyze it (with POI?) and detect which objects are
located at top-directory (directly under Root Directory entry) of the OLE
file.  For word this is object WordDocument, while for Excel this is
Workbook or Book.  Simple search for corresponding names will not help,
because all these objects could be embedded into other documents via OLE.

Other details you can find in official Microsoft Documentation

Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
 ST> Hi,

 ST> I haven't seen any responses to this. Does anyone know why I should be
 ST> seeing such unpredictable behaviour?

 ST> Simon

 ST> On 15/03/2010 09:27, "Simon Tyler" <styler@mimecast.net> wrote:

 >> Hi,
 >> I am doing some testing of Tika 0.6 and noticed some odd results for the
 >> testEXCEL.xls file included in the test suite.
 >> 100 calls to the following code:
 >>              is = new BufferedInputStream(new FileInputStream(filename));
 >>             Metadata metadata = new Metadata();
 >>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
 >>             String type = tika.detect(is, metadata);
 >> Results in different matches as application/msword or
 >> application/vnd.ms-excel seemingly at random.
 >> Is this expected? Is there a way to mitigate it?
 >> Simon

With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/

View raw message