tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: Configuring parsers and translators
Date Mon, 08 Jun 2015 14:24:41 GMT
Tyler, I see your devil's advocate point.  

I strongly agree with Chris about the benefit of centralizing configuration and making it
easy to dump and modify the TikaConfig file.

Even though the TikaConfig file might get ugly, it would be far better to have everything
nailed down there than searching through service loaders...IMHO.

I opened TIKA-1508 a while ago and haven't had any time to work on it...this just deals with
simple parameter settings for parsers, not the far more difficult/interesting stuff that we've
discussed with composite parsers.

>> My main worry with putting it all into config xml is that we accidently end up re-inventing
spring badly...

Yeah, or re-inventing Solr's parameter loading as my example does... :(

I think that basic parameter setting should at least be fairly trivial to code...time allowing...argh.


-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Saturday, June 06, 2015 7:01 PM
To: dev@tika.apache.org
Subject: Re: Configuring parsers and translators

Hey Tyler,

I hear you, but balance that against all the hidden things here
and there, and everywhere, that I constantly keep discovering and
having to pour through lines of TikaConfig - service loaders, class
loaders.

When things work right - no problem. When something goes wrong;
HUGE waste of time.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Tyler Palsulich <tpalsulich@gmail.com>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Saturday, June 6, 2015 at 3:59 PM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: Re: Configuring parsers and translators

>(Devil's advocate hat slightly on.) My one hesitation about putting it all
>into tika-config is that the default might get to be a monstrosity --
>difficult for new users to use.
>
>Tyler
>
>On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> I think it would be great to have all this in the Tika Config.
>>
>> The one thing then is to provide an example default config and
>> to make it *hugely* clear rather than all the levels of indirection
>> that we currently have going on which makes it super hard when
>> there is a config error (SPI, swallowing print messages, etc.)
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tyler Palsulich <tpalsulich@gmail.com>
>> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
>> Date: Saturday, June 6, 2015 at 3:45 PM
>> To: "dev@tika.apache.org" <dev@tika.apache.org>
>> Subject: Re: Configuring parsers and translators
>>
>> >Hi Nick,
>> >
>> >I've been mulling this over since you sent the first message. But, I'm
>> >afraid I don't have a good solution or developed ideas.
>> >
>> >I agree, it would be very nice to consolidate all configuration for all
>> >parsers in the server and app.
>> >
>> >Is it feasible to put everything into tika-config? Then Parser
>> >implementations would read the config to pull out their own
>>configuration.
>> >Or, would it be better to keep some configuration separate?
>>Documentation
>> >would be an issue if every parser defines its own metadata keys...
>>But, it
>> >might be an improvement since we don't have "free form" properties and
>> >configuration files.
>> >
>> >Tyler
>> >
>> >On Sat, Jun 6, 2015 at 12:30 PM Nick Burch <apache@gagravarr.org>
>>wrote:
>> >
>> >> Anyone have any thoughts on this?
>> >>
>> >> On Fri, 8 May 2015, Nick Burch wrote:
>> >> > Hi All
>> >> >
>> >> > This came up in TIKA-1623, but I thought it might be better brought
>> >>out
>> >> to
>> >> > the list for discussion
>> >> >
>> >> > To configure parsers on a per-document basis, such as setting PDF
>> >> > spacing tolerances, or telling Tesseract what language it should be
>> >> > OCRing for, we have the *Config objects. You create one of these,
>>use
>> >> > the setters to configure it for your document, pop it onto the
>>Parse
>> >> > context and it's used when processing your document
>> >> >
>> >> > To configure parsers and translators on a per-JVM basis, to apply
>>to
>> >>all
>> >> > documents processed, it's a bit less consistent. At least some look
>> >>for
>> >> > a properties file with a specific name, usually in the tika
>>namespace,
>> >> > and grab their settings / keys / etc out of that. At least some
>>expect
>> >> > to find a *Config with their program path on it, even though that
>> >> > remains constant between documents. None of them support getting
>>their
>> >> > settings from the Tika Config
>> >> >
>> >> >
>> >> > As part of our evolution of parser preferences, we're moving
>>towards
>> >> > people either being able to set their preferences in code, or being
>> >>able
>> >> > to supply a Tika Config xml which sets their parser preferences or
>> >> > overrides certain bits of the default. The code option works for
>> >>people
>> >> > who want to declare certain specific things, the Tika Config one
>>gives
>> >> > the same functionality but allows a consistent and clean way to
>>set it
>> >> > between Tika App, Tika Server and java code.
>> >> >
>> >> > Another related example is the External Parser support. Because you
>> >>can
>> >> > have multiple External Parser instances in your setup, one per
>>format
>> >>/
>> >> > program, we look for all the
>> >> > org/apache/tika/parser/external/tika-external-parsers.xml files on
>>the
>> >> > classpath, and create parser instances based on definitions in
>>there
>> >> >
>> >> >
>> >> > What do we think about setting executable paths and keys/logins for
>> >> > parsers like OCR, Strings, Translators etc? Always on ParseContext?
>> >> > Properties? Custom xml config? Tika config xml? Other? Combination?
>> >> >
>> >> > Nick
>> >> >
>> >>
>>
>>

Mime
View raw message