tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Tika configuration (Was: Using URL's for Input Resource Specifiers: How can I help?)
Date Fri, 21 Sep 2007 07:19:01 GMT

On 9/20/07, Chris Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
> 1. How do we configure Tika? This somewhat relates to the prior discussion
> on the Tika parser interface, but it extends beyond that.
> [...]
> 1. We define a TikaConfiguration object, and an xml file location/format
> within CM for Tika configuration properties. [...]

I'd like to keep the configuration part structurally separate form the
parser and other components, so that one could easily integrate Tika
components in various different environments like IoC containers, etc.

We could have a "native" Tika configuration file and a simple
mechanism that converts the configuration to active parser (and other)
instances, but it should be possible to use the Tika features even
without such a configuration.

My preference would be to use the JavaBean conventions for any
configuration options on parser and other Tika classes to avoid extra
dependencies on custom configuration objects (see also TIKA-23). A
native configuration mechanism could use the property setters just
like a generic IoC container or even a hardcoded client application

>  2. What are the right data attributes to configure a parser? Could we get
> some documentation on them? [...]
> [...]
>  2. We sit down and baseline a set of properties, including documentation on
> them, for tika parsers.

I think the configuration of a parser class will be highly dependent
on the content format it uses, so there may not be that many truly
global configuration options. I even think that the mime type should
be a part of content metadata and not of parser configuration.

However, I very much agree with the drive to plan and document the
available configuration options.

> We should also change everything in CM right now that says "Luis" to "Tika".


> 3. What are the entry points into Tika? As far as I can tell, there is a
> ParserFactory that can be used to get a Parser for a particular file or Url,
> etc. This implies that the ParserFactory performs some sort of mime type
> resolution (which it does), however, mime type resolution (using the new
> mime framework) requires the ability for Tika to have a configuration.

I think we should try to keep Tika as modular as possible and have
multiple different entry points depending on the set of functionality
and amount of customization a client wants.

Currently I could foresee Tika being composed of three independent
components (parsing, mime type detection, configuration) and a helper
layer that binds these three together. It should be possible (and
easy) for a client to reach directly to even a single parser class and
use just that, but also to invoke a single helper method that looks up
a configuration file, instantiates a set of parser and type detection
components, retrieves a resource identified by a URI, and extracts the
text content of the resource using all the configured components.


Jukka Zitting

View raw message