tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Grant <tgr...@sms-fed.com>
Subject Re: Appending Mime Types
Date Tue, 23 Aug 2011 19:58:59 GMT
I'm happy to do the work and contribute it as a patch.  I guess I'm just
looking for advice on the approach to ensure that what I provide does
actually get incorporated.  My particular use case is solved by adding the
update methods to MimeTypesFactory (See last message), but I'm in a scenario
where I'm appending new types and not over-writing any of the existing
ones.   The over-writing use case would also bring in a conflict resolution
requirement like the recent addition to the Parser loading logic.

I personally like the approach of loading the standard
/org/apache/tika/mime/tika-mimetypes.xml file, followed by any and all
META-INF/tika-mimetypes.xml resources using the ServiceLoader class,
followed by an optional tika-mimetypes.xml resource from the classpath for
conflict resolution.  This would handle my use case and the over-write one.

I still like having the update methods on MimeTypeFactory because in my
particular application we use multiple classloaders to isolate the Parser
implementations and its easier for me to push the information to Tika than
to have Tika pull the information from my application on startup.


On Tue, Aug 23, 2011 at 7:20 AM, Nick Burch <nick.burch@alfresco.com> wrote:

> On Mon, 22 Aug 2011, Tom Grant wrote:
>> Here's the use case that I'm attempting to solve.  I have a customer with
>> many legacy systems, some of which are completely custom.  These systems
>> have data files that will never be seen outside of their environment.  For
>> example, some are XML files with their own schemas. Some are similar to the
>> new office documents and are zip files containing xml and other goodies.
>> Others are serialized-objects dumped to disk.  Some are similar to EDI with
>> a header and data body with prescribed offsets. The choices of the past
>> can't be undone and I'm stuck with about 30 or 40 different file types.
> Ah, so you have non standard, custom and specific mimetypes that you're
> allocating to these documents. I think we'd tended to think of the mimetypes
> as always being like constants
>  The quantity of file types means that its going to take a few months to
>> complete and will happen a few at a time.  So I'd like to co-locate the
>> mimetype definition with the parser code for maintainability.
> Your best bet is probably to do a custom detector, and have that loaded by
> the service loader the same way that the container aware detector now can
> be. You can put that in your code along with your custom parsers
> I'm not sure what the best way to support this kind of need is. Some
> options that spring to mind are:
> * Loading multiple mimetype files, and merging them like we do for parser
>  class loading
> * Provide another detector that loads custom-mimetypes.xml files from the
>  service loader (so you can have multiple ones) which are used for
>  detection
> I guess it depends on if you'd expect to be able to work with the heirarchy
> of the custom extra types or not?
> I'm not sure we should be proving ways to add a couple of extra types in at
> a random point in time, as that'll potentially make things behave very
> differently in a multithreaded environment. I'd rather that the extra types
> were loaded once up front, in whichever way is supported
> Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message