tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Grant <tgr...@sms-fed.com>
Subject Re: Appending Mime Types
Date Wed, 24 Aug 2011 01:26:45 GMT
The Parser conflict Jira issue I was referring to is TIKA-527 (Allow
override mapping mime<-->parsers through
config<https://issues.apache.org/jira/browse/TIKA-527>).
We would need something similar for mime types.

The updates to MimeTypesFactory might address TIKA-87 (MimeTypes should
allow modification of MIME
types<https://issues.apache.org/jira/browse/TIKA-87>).
The MimeType class wouldn't be modifiable but the goal of loading mime types
from multiple sources would be achieved.

-Tom

On Tue, Aug 23, 2011 at 3:58 PM, Tom Grant <tgrant@sms-fed.com> wrote:

> Nick,
> I'm happy to do the work and contribute it as a patch.  I guess I'm just
> looking for advice on the approach to ensure that what I provide does
> actually get incorporated.  My particular use case is solved by adding the
> update methods to MimeTypesFactory (See last message), but I'm in a scenario
> where I'm appending new types and not over-writing any of the existing
> ones.   The over-writing use case would also bring in a conflict resolution
> requirement like the recent addition to the Parser loading logic.
>
> I personally like the approach of loading the standard
> /org/apache/tika/mime/tika-mimetypes.xml file, followed by any and all
> META-INF/tika-mimetypes.xml resources using the ServiceLoader class,
> followed by an optional tika-mimetypes.xml resource from the classpath for
> conflict resolution.  This would handle my use case and the over-write one.
>
>
> I still like having the update methods on MimeTypeFactory because in my
> particular application we use multiple classloaders to isolate the Parser
> implementations and its easier for me to push the information to Tika than
> to have Tika pull the information from my application on startup.
>
> -Tom
>
>
>
> On Tue, Aug 23, 2011 at 7:20 AM, Nick Burch <nick.burch@alfresco.com>wrote:
>
>> On Mon, 22 Aug 2011, Tom Grant wrote:
>>
>>> Here's the use case that I'm attempting to solve.  I have a customer with
>>> many legacy systems, some of which are completely custom.  These systems
>>> have data files that will never be seen outside of their environment.  For
>>> example, some are XML files with their own schemas. Some are similar to the
>>> new office documents and are zip files containing xml and other goodies.
>>> Others are serialized-objects dumped to disk.  Some are similar to EDI with
>>> a header and data body with prescribed offsets. The choices of the past
>>> can't be undone and I'm stuck with about 30 or 40 different file types.
>>>
>>
>> Ah, so you have non standard, custom and specific mimetypes that you're
>> allocating to these documents. I think we'd tended to think of the mimetypes
>> as always being like constants
>>
>>
>>  The quantity of file types means that its going to take a few months to
>>> complete and will happen a few at a time.  So I'd like to co-locate the
>>> mimetype definition with the parser code for maintainability.
>>>
>>
>> Your best bet is probably to do a custom detector, and have that loaded by
>> the service loader the same way that the container aware detector now can
>> be. You can put that in your code along with your custom parsers
>>
>>
>> I'm not sure what the best way to support this kind of need is. Some
>> options that spring to mind are:
>> * Loading multiple mimetype files, and merging them like we do for parser
>>  class loading
>> * Provide another detector that loads custom-mimetypes.xml files from the
>>  service loader (so you can have multiple ones) which are used for
>>  detection
>>
>> I guess it depends on if you'd expect to be able to work with the
>> heirarchy of the custom extra types or not?
>>
>> I'm not sure we should be proving ways to add a couple of extra types in
>> at a random point in time, as that'll potentially make things behave very
>> differently in a multithreaded environment. I'd rather that the extra types
>> were loaded once up front, in whichever way is supported
>>
>> Nick
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message