tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni.my...@gmail.com>
Subject Re: Appending Mime Types
Date Fri, 19 Aug 2011 08:59:02 GMT

:) I had an identical workaround for a while: http://bit.ly/p2Dqc4. Mine 
was named "MimeTypesEnhancer" but we're essentially reinventing the same 
wheel. Later we switched to encapsulating the Tika detector in our own 
class, which stores a DOM Document, performs all CRUD operations on the 
DOM and uses the only "accepted" method: 
MimeTypesFactory.create(Document). Details in Aperture's 
"TikaMimeTypeIdentifier" class: http://bit.ly/pJjftV.

I thought I was the only one who'd like to access the MimeTypes 
definition database programmatically. This makes us two.

Antoni Myłka

W dniu 2011-08-19 00:04, Tom Grant pisze:
> Is there a way to programmatically register new Mime Types?  We have a way
> to plug-in new parsers, but I do not see a way to define new file types.
> I'd like to be able to contribute both the Mime Type definitions as well as
> the Parser implementations that parse them in a single plugin Jar file.  The
> code to update Mime Types exists in org.apache.tika.mime.MimeTypesReader but
> that class is package scope.  I would like it to be public, or provide
> another class like the one attached that exposes its functionality.  The key
> is that I want to keep the standard Mime Types and just append or override a
> few of my own.  I currently append to the Mime Types using:
> MimeTypes types = _tikaConfig.getMimeRepository();
> MimeTypesAppender appender = new MimeTypesAppender(types);
> appender.append(mimeDoc);
> I realize that I can copy the tika-mimetypes.xml file and add my own types,
> but it requires that I maintain one master file, and that I update it every
> time someone on my team adds or removes a new parser. I then run the risk of
> getting out of sync with the one distributed with Tika. I think a better
> approach might be to add another META-INF/ file that contains the extra mime
> types that should be loaded by Tika.
> org.apache.tika.config.ServiceLoader.findServiceResources hints at this
> approach but it doesn't appear to be in place.  MimeTypes
> getDefaultMimeTypes() just loads a single file.
> -Tom
> package org.apache.tika.mime;
> import java.io.IOException;
> import java.io.InputStream;
> import org.w3c.dom.Document;
> /**
>   * Works around the fact that the MimeTypesReader class is package scope.
>   *
>   */
> public class MimeTypesAppender {
>      private final MimeTypesReader _reader;
>      public MimeTypesAppender(MimeTypes types) {
>          this._reader = new MimeTypesReader(types);
>      }
>      public void append(Document doc) throws MimeTypeException {
>          _reader.read(doc);
>      }
>      public void append(InputStream is) throws MimeTypeException, IOException
> {
>          _reader.read(is);
>      }
> }

View raw message