The Parser conflict Jira issue I was referring to is TIKA-527 (Allow
override mapping mime<-->parsers through
config<https://issues.apache.org/jira/browse/TIKA-527>).
We would need something similar for mime types.
The updates to MimeTypesFactory might address TIKA-87 (MimeTypes should
allow modification of MIME
types<https://issues.apache.org/jira/browse/TIKA-87>).
The MimeType class wouldn't be modifiable but the goal of loading mime types
from multiple sources would be achieved.
-Tom
On Tue, Aug 23, 2011 at 3:58 PM, Tom Grant <tgrant@sms-fed.com> wrote:
> Nick,
> I'm happy to do the work and contribute it as a patch. I guess I'm just
> looking for advice on the approach to ensure that what I provide does
> actually get incorporated. My particular use case is solved by adding the
> update methods to MimeTypesFactory (See last message), but I'm in a scenario
> where I'm appending new types and not over-writing any of the existing
> ones. The over-writing use case would also bring in a conflict resolution
> requirement like the recent addition to the Parser loading logic.
>
> I personally like the approach of loading the standard
> /org/apache/tika/mime/tika-mimetypes.xml file, followed by any and all
> META-INF/tika-mimetypes.xml resources using the ServiceLoader class,
> followed by an optional tika-mimetypes.xml resource from the classpath for
> conflict resolution. This would handle my use case and the over-write one.
>
>
> I still like having the update methods on MimeTypeFactory because in my
> particular application we use multiple classloaders to isolate the Parser
> implementations and its easier for me to push the information to Tika than
> to have Tika pull the information from my application on startup.
>
> -Tom
>
>
>
> On Tue, Aug 23, 2011 at 7:20 AM, Nick Burch <nick.burch@alfresco.com>wrote:
>
>> On Mon, 22 Aug 2011, Tom Grant wrote:
>>
>>> Here's the use case that I'm attempting to solve. I have a customer with
>>> many legacy systems, some of which are completely custom. These systems
>>> have data files that will never be seen outside of their environment. For
>>> example, some are XML files with their own schemas. Some are similar to the
>>> new office documents and are zip files containing xml and other goodies.
>>> Others are serialized-objects dumped to disk. Some are similar to EDI with
>>> a header and data body with prescribed offsets. The choices of the past
>>> can't be undone and I'm stuck with about 30 or 40 different file types.
>>>
>>
>> Ah, so you have non standard, custom and specific mimetypes that you're
>> allocating to these documents. I think we'd tended to think of the mimetypes
>> as always being like constants
>>
>>
>> The quantity of file types means that its going to take a few months to
>>> complete and will happen a few at a time. So I'd like to co-locate the
>>> mimetype definition with the parser code for maintainability.
>>>
>>
>> Your best bet is probably to do a custom detector, and have that loaded by
>> the service loader the same way that the container aware detector now can
>> be. You can put that in your code along with your custom parsers
>>
>>
>> I'm not sure what the best way to support this kind of need is. Some
>> options that spring to mind are:
>> * Loading multiple mimetype files, and merging them like we do for parser
>> class loading
>> * Provide another detector that loads custom-mimetypes.xml files from the
>> service loader (so you can have multiple ones) which are used for
>> detection
>>
>> I guess it depends on if you'd expect to be able to work with the
>> heirarchy of the custom extra types or not?
>>
>> I'm not sure we should be proving ways to add a couple of extra types in
>> at a random point in time, as that'll potentially make things behave very
>> differently in a multithreaded environment. I'd rather that the extra types
>> were loaded once up front, in whichever way is supported
>>
>> Nick
>>
>
>
>
|