tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Error thrown with TikaConfig() constructor
Date Sun, 12 Sep 2010 19:26:04 GMT
Hi Jukka,

> On Sun, Sep 12, 2010 at 5:46 PM, Ken Krugler
> <kkrugler_lists@transpac.com> wrote:
>> But that also seems clunky. Any other suggestions?
> A simpler approach would be to simply pass a list of already
> instantiated Parser objects to AutoDetectParser, like this:
>    public AutoDetectParser(Detector detector, Parser... parsers) {
>        setDetector(detector);
>        Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
>        ParseContext context = new ParseContext();
>        for (Parser parser : parsers) {
>            for (MediaType type : parser.getSupportedTypes(context)) {
>                map.put(type, parser);
>            }
>        }
>        setParsers(map);
>    }

Thanks for the suggestion. This would work for the current 0.8 code  
base, so I might just go ahead and add that.

But I found a few other places that called  
TikaConfig.getDefaultConfig() besides AutoDetectParser():
  - Tika()
  - MediaTypeRegistry.getDefaultRegistry()

These don't seem to be used outside of test code, but I could easily  
see people adding calls to them (and getDefaultConfig).

Depending on not having any calls to this from anywhere else in the  
Tika sub-system seems fragile, so a more resilient solution would be  
good. Especially since this is the second time this problem has bitten  
me during a big parse job (20M+ documents).

-- Ken

> BTW, the need to pass a MediaType->Parser map to
> CompositeParser.setParsers() is a remnant of the time when we didn't
> have the Parser.getSupportedTypes() method. Nowadays it would probably
> be better to simply pass a collection of parsers and use
> getSupportedTypes() calls for dispatch during CompositeParser.parse().
>> As an aside, what's the standard use case for specifying an explicit
>> classloader? I haven't seen this used in other projects, so I'm  
>> curious.
> See TIKA-419 [1] the relevant background.
> [1] https://issues.apache.org/jira/browse/TIKA-419

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message