tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject failing to detecting mime types from custom mimetype.xml
Date Mon, 26 Jan 2009 05:49:35 GMT
I tried asking this on the users' list, but no one responded, so I  
guess I'll try the dev list.

Tika's mime type detection routinely fails on fairly common files.   
For instance, every gif I've tried Tika returns application/octet- 
stream rather than image/gif.  Plain text files without any extension  
also get marked application/octet-stream instead of text/plain.

I copied the tika-mimetypes.xml that's included in tika-0.3- 
SNAPSHOT.tar and added a glob for image/gif via:

   <mime-type type="image/gif">
     <glob pattern="*.gif" />
   </mime-type>

(which is odd that this has to be added since the default tika- 
config.xml configures a parser for this mime-type.)

I loaded the xml file via:

             mimeTypes = MimeTypesFactory.create("/fullpath/tika- 
mimetypes.xml");
             parser = new AutoDetectParser();
             parser.setMimeTypes(mimeTypes);

but apparently the config file is silently failing to be loaded, or  
being ignored, or AutoDetectParser's mime detector isn't correctly  
checking the globs or something, none of which makes any sense.   
Something should either write to stderr or throw an exception if this  
was the case.

MimeTypes doesn't have a way to listing what mime-types are  
registered, and I can't find a publicly accessible way to  
programatically add a new MimeType to the MimeTypes class.  (You can  
add glob patterns, but not actual types.  `new MimeType()` fails  
because it's called outside its package.)

There is either a bug here, or there's some trick that is completely  
undocumented, or somehow tika-0.3-SNAPSHOT-standalone.jar is  
overriding everything.

I've even tried creating a new tika-config.xml with the fullpath to my  
tika-mimetypes.xml in it

         try {
             configFile= new File("/fullpath/tika-config.xml");
             config = new TikaConfig(configFile);
             parser = new AutoDetectParser(config);
             contentHandler = getContentHandler();
         } catch (org.xml.sax.SAXException e) {
             System.err.println("cant parse " + e.getMessage());
         } catch (TikaException e) {
             System.err.println("tika exception " + e.getMessage());
         } catch  (IOException e) {
             System.err.println("can't read configfile " +  
e.getMessage());
         }
     }

but that just causes NullPointerException s to be thrown.

This is beyond frustrating.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Mime
View raw message