tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] Updated: (TIKA-527) Allow override mapping mime<-->parsers through config
Date Wed, 03 Nov 2010 01:34:24 GMT

     [ https://issues.apache.org/jira/browse/TIKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jan Høydahl updated TIKA-527:

    Attachment: TIKA-527.patch

This patch modifies the empty constructor TikaConfig() to try to initialize from XML config:

Creates a default Tika configuration. We first check whether an XML config file is specified,
either in
  1. System property "tika.config", or
  2. Environment variable TIKA_CONFIG
If one of these have a value, we try to resolve it as an absolute or relative file in the
file system, and then as a file on classpath.
If XML config is not specified, we initialize from the built-in media type rules and all the
Parser implementations available through the service provider mechanism in the context class
loader of the current thread.

This together with DefaultParser and EmptyParser allows for all kinds of custom mappings simply
through configuration.

Example of the system property method using tika-app:

> java -Dtika.config=my-tika-config.xml -jar tika-app-0.8-SNAPSHOT.jar --list-parser-details

Content of my-tika-config.xml:
        <parser class="org.apache.tika.parser.audio.AudioParser"> 

> Allow override mapping mime<-->parsers through config
> -----------------------------------------------------
>                 Key: TIKA-527
>                 URL: https://issues.apache.org/jira/browse/TIKA-527
>             Project: Tika
>          Issue Type: Improvement
>          Components: config
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>         Attachments: TIKA-527.patch
> Background
> -----------------
> As of Tika 0.7, tika-config.xml is not longer mandatory and loading 3rd party parsers
as plugins through service architecture is supported.
> This introduces great flexibility, and even allows for extending Tika's file format support
by simply dropping in jar's on the classpath. This is great for configuring Tika when it's
embedded as part of another application such as Solr or Nutch. You can easily add support
for e.g. a commercial document filter with Tika wrapper without changing Tika or the consuming
application, or even maintaining a tika-config.xml.
> This serves the majority of all use cases.
> Problem
> ------------
> However, as the variety of 3rd party document parsers increases, we'll start seeing an
overlap of parsers supporting the same mime-types. A very likely scenario is a company specialized
in document filters packaging their parsers as a Tika plugin, under whatever license they
> In this scenario, a system integrator (working with e.g. Solr) wants to gather all the
parsers that the particular customer needs, and then choose which parser should handle each
mime-type. She may want to let a 3rd party parser plugin handle Word files but the Tika supplied
POI parser handle Excel.
> Today, the last parser plugin that gets loaded by the class-loader happens to "win" the
mime-types it supports. As it is not uncommon for one parser to register multiple mime-types,
re-claiming a subset of the types is not possible unless you are consuming Tika directly.
> We thus need an "override" mime-to-parser mapping by configuration, and Tika needs to
look for this config by default when starting.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message