tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Palsulich <tpalsul...@gmail.com>
Subject Re: Configuring parsers and translators
Date Sat, 06 Jun 2015 22:59:21 GMT
(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> I think it would be great to have all this in the Tika Config.
>
> The one thing then is to provide an example default config and
> to make it *hugely* clear rather than all the levels of indirection
> that we currently have going on which makes it super hard when
> there is a config error (SPI, swallowing print messages, etc.)
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Tyler Palsulich <tpalsulich@gmail.com>
> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> Date: Saturday, June 6, 2015 at 3:45 PM
> To: "dev@tika.apache.org" <dev@tika.apache.org>
> Subject: Re: Configuring parsers and translators
>
> >Hi Nick,
> >
> >I've been mulling this over since you sent the first message. But, I'm
> >afraid I don't have a good solution or developed ideas.
> >
> >I agree, it would be very nice to consolidate all configuration for all
> >parsers in the server and app.
> >
> >Is it feasible to put everything into tika-config? Then Parser
> >implementations would read the config to pull out their own configuration.
> >Or, would it be better to keep some configuration separate? Documentation
> >would be an issue if every parser defines its own metadata keys... But, it
> >might be an improvement since we don't have "free form" properties and
> >configuration files.
> >
> >Tyler
> >
> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apache@gagravarr.org> wrote:
> >
> >> Anyone have any thoughts on this?
> >>
> >> On Fri, 8 May 2015, Nick Burch wrote:
> >> > Hi All
> >> >
> >> > This came up in TIKA-1623, but I thought it might be better brought
> >>out
> >> to
> >> > the list for discussion
> >> >
> >> > To configure parsers on a per-document basis, such as setting PDF
> >> > spacing tolerances, or telling Tesseract what language it should be
> >> > OCRing for, we have the *Config objects. You create one of these, use
> >> > the setters to configure it for your document, pop it onto the Parse
> >> > context and it's used when processing your document
> >> >
> >> > To configure parsers and translators on a per-JVM basis, to apply to
> >>all
> >> > documents processed, it's a bit less consistent. At least some look
> >>for
> >> > a properties file with a specific name, usually in the tika namespace,
> >> > and grab their settings / keys / etc out of that. At least some expect
> >> > to find a *Config with their program path on it, even though that
> >> > remains constant between documents. None of them support getting their
> >> > settings from the Tika Config
> >> >
> >> >
> >> > As part of our evolution of parser preferences, we're moving towards
> >> > people either being able to set their preferences in code, or being
> >>able
> >> > to supply a Tika Config xml which sets their parser preferences or
> >> > overrides certain bits of the default. The code option works for
> >>people
> >> > who want to declare certain specific things, the Tika Config one gives
> >> > the same functionality but allows a consistent and clean way to set it
> >> > between Tika App, Tika Server and java code.
> >> >
> >> > Another related example is the External Parser support. Because you
> >>can
> >> > have multiple External Parser instances in your setup, one per format
> >>/
> >> > program, we look for all the
> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on the
> >> > classpath, and create parser instances based on definitions in there
> >> >
> >> >
> >> > What do we think about setting executable paths and keys/logins for
> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
> >> >
> >> > Nick
> >> >
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message