tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Tyler <sty...@mimecast.net>
Subject Re: Detector results for Excel formats
Date Wed, 24 Mar 2010 09:32:03 GMT

Raised https://issues.apache.org/jira/browse/TIKA-391 and provided a Tika
0.6  based fix. There might be more involved a fully fix as the issue can
apply to any method that uses the results from getMimeType.

Simon

On 23/03/2010 13:13, "Mattmann, Chris A (388J)"
<chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Simon,
> 
> Can you prepare a patch, and post it to JIRA? I'll happily take a look.
> 
> Thanks,
> Chris
> 
> 
> On 3/23/10 3:43 AM, "Simon Tyler" <styler@mimecast.net> wrote:
> 
> 
> 
> I have had a further look at the nature of the failure to detect the type of
> the particular file and still feel it is a bug.
> 
> This is an excel (.xls) spreadsheet and I give the detector the correct
> filename and correct content content type for it. The detector still fails
> to identify it correctly sometimes.
> 
> I had a look at the code and the reason is now clear to me and is easily
> fixed.
> 
> The getMimeType method searches for a magic match and stops at the first
> hit. The search is ordered (based on priority, size and clause). This
> particular file matches two detectors (word and excel) which compare
> identically - this means the order of them in the SortedSet is undefined,
> this is the cause of the problem.
> 
> A fix is for getMimeType to return the complete set of matches rather than a
> single match and then to use the filename and content-type hints on each
> match returning the first that matches either. I have modified the code to
> do this and it solves the problem. The hint matching could be improved
> further if necessary so that it picks the best match from the set based on
> both hints rather than just stopping at the first.
> 
> Simon
> 
> 
> On 18/03/2010 19:16, "Alex Ott" <alexott@gmail.com> wrote:
> 
>> Re
>> 
>> Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
>>  KK> Thanks, Alex - great input.
>> 
>>  KK> We'd run into similar problems at Krugle, with determining the correct
>> mime-type for
>>  KK> source code. Sometimes you wind up needing to parse the  code to make
>> the
>> correct choice.
>> 
>>  KK> We had extended the Nutch mime-type detector to support both regex and
>> post-processing to
>>  KK> handle this disambiguation.
>> 
>>  KK> But that was hard-coded for a handful of known edge cases.
>> 
>>  KK> One possible way for this to work with the current XML-based mime-type
>> definitions is to
>>  KK> have a "here's the name of the class you'll have to  instantiate and run
>> to make the final
>>  KK> call"
>> 
>> Yes - I have something like in my own media type detector (for data leak
>> prevention) - when signature (either CFBF or Zip) is found, then
>> corresponding code is called, that return constant, that correspond to some
>> type (I need to implement logic inside my own code, because sometimes rules
>> are to complex to express them in simplier rules).   At the end I have
>> something like:
>> 
>> if CFBF Signature then get type from CFBF and if type == NNN then mimetype =
>> word/excel/...
>> 
>> But i have special lisp-like language to describe complex checks...
>> 
>>  KK> -- Ken
>> 
>>  KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:
>> 
>>>> 
>>>> I'm not sure, that this is actual for Tika, but I looked into its mime
>>>> database and see problem in definitions - both types uses common OLE (MS
>>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by
>>>> dozens of file formats.  To perform correct mime type detection of CFBF
>>>> files, you need to analyze it (with POI?) and detect which objects are
>>>> located at top-directory (directly under Root Directory entry) of the OLE
>>>> file.  For word this is object WordDocument, while for Excel this is
>>>> Workbook or Book.  Simple search for corresponding names will not help,
>>>> because all these objects could be embedded into other documents via OLE.
>>>> 
>>>> Other details you can find in official Microsoft Documentation
>>>> 
>>>> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
>>>> ST> Hi,
>>>> 
>>>> ST> I haven't seen any responses to this. Does anyone know why I should
be
>>>> ST> seeing such unpredictable behaviour?
>>>> 
>>>> ST> Simon
>>>> 
>>>> ST> On 15/03/2010 09:27, "Simon Tyler" <styler@mimecast.net> wrote:
>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I am doing some testing of Tika 0.6 and noticed some odd results
for the
>>>>>> testEXCEL.xls file included in the test suite.
>>>>>> 
>>>>>> 100 calls to the following code:
>>>>>> 
>>>>>>             is = new BufferedInputStream(new FileInputStream(filename));
>>>>>> 
>>>>>>            Metadata metadata = new Metadata();
>>>>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>>>> 
>>>>>>            String type = tika.detect(is, metadata);
>>>>>> 
>>>>>> Results in different matches as application/msword or
>>>>>> application/vnd.ms-excel seemingly at random.
>>>>>> 
>>>>>> Is this expected? Is there a way to mitigate it?
>>>>>> 
>>>>>> Simon
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> With best wishes, Alex Ott, MBA
>>>> http://alexott.blogspot.com/        http://alexott.net/
>>>> http://alexott-ru.blogspot.com/
>> 
>>  KK> --------------------------------------------
>>  KK> Ken Krugler
>>  KK> +1 530-210-6378
>>  KK> http://bixolabs.com
>>  KK> e l a s t i c   w e b   m i n i n g
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 




Mime
View raw message