tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Grant <tgr...@sms-fed.com>
Subject Re: Appending Mime Types
Date Wed, 24 Aug 2011 01:26:45 GMT
The Parser conflict Jira issue I was referring to is TIKA-527 (Allow
override mapping mime<-->parsers through
We would need something similar for mime types.

The updates to MimeTypesFactory might address TIKA-87 (MimeTypes should
allow modification of MIME
The MimeType class wouldn't be modifiable but the goal of loading mime types
from multiple sources would be achieved.


On Tue, Aug 23, 2011 at 3:58 PM, Tom Grant <tgrant@sms-fed.com> wrote:

> Nick,
> I'm happy to do the work and contribute it as a patch.  I guess I'm just
> looking for advice on the approach to ensure that what I provide does
> actually get incorporated.  My particular use case is solved by adding the
> update methods to MimeTypesFactory (See last message), but I'm in a scenario
> where I'm appending new types and not over-writing any of the existing
> ones.   The over-writing use case would also bring in a conflict resolution
> requirement like the recent addition to the Parser loading logic.
> I personally like the approach of loading the standard
> /org/apache/tika/mime/tika-mimetypes.xml file, followed by any and all
> META-INF/tika-mimetypes.xml resources using the ServiceLoader class,
> followed by an optional tika-mimetypes.xml resource from the classpath for
> conflict resolution.  This would handle my use case and the over-write one.
> I still like having the update methods on MimeTypeFactory because in my
> particular application we use multiple classloaders to isolate the Parser
> implementations and its easier for me to push the information to Tika than
> to have Tika pull the information from my application on startup.
> -Tom
> On Tue, Aug 23, 2011 at 7:20 AM, Nick Burch <nick.burch@alfresco.com>wrote:
>> On Mon, 22 Aug 2011, Tom Grant wrote:
>>> Here's the use case that I'm attempting to solve.  I have a customer with
>>> many legacy systems, some of which are completely custom.  These systems
>>> have data files that will never be seen outside of their environment.  For
>>> example, some are XML files with their own schemas. Some are similar to the
>>> new office documents and are zip files containing xml and other goodies.
>>> Others are serialized-objects dumped to disk.  Some are similar to EDI with
>>> a header and data body with prescribed offsets. The choices of the past
>>> can't be undone and I'm stuck with about 30 or 40 different file types.
>> Ah, so you have non standard, custom and specific mimetypes that you're
>> allocating to these documents. I think we'd tended to think of the mimetypes
>> as always being like constants
>>  The quantity of file types means that its going to take a few months to
>>> complete and will happen a few at a time.  So I'd like to co-locate the
>>> mimetype definition with the parser code for maintainability.
>> Your best bet is probably to do a custom detector, and have that loaded by
>> the service loader the same way that the container aware detector now can
>> be. You can put that in your code along with your custom parsers
>> I'm not sure what the best way to support this kind of need is. Some
>> options that spring to mind are:
>> * Loading multiple mimetype files, and merging them like we do for parser
>>  class loading
>> * Provide another detector that loads custom-mimetypes.xml files from the
>>  service loader (so you can have multiple ones) which are used for
>>  detection
>> I guess it depends on if you'd expect to be able to work with the
>> heirarchy of the custom extra types or not?
>> I'm not sure we should be proving ways to add a couple of extra types in
>> at a random point in time, as that'll potentially make things behave very
>> differently in a multithreaded environment. I'd rather that the extra types
>> were loaded once up front, in whichever way is supported
>> Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message