tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Error thrown with TikaConfig() constructor
Date Sat, 11 Sep 2010 20:17:55 GMT
> On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch  
> <nick.burch@alfresco.com> wrote:
>> Quite a lot of OfficeParser does depend on poifs code though, as  
>> well as a
>> few bits that depend on some of the less common POI text extractors.
> It looks like a number of our other new parsers also have direct
> dependencies to external libraries, so this problem is not just
> related to the OfficeParser class.
> The basic problem here is that the service loader used by the default
> TikaConfig constructor throws an exception when it can't load a class
> listed in a org.apache.tika.parser.Parser service file. The solution I
> implemented in TIKA-378 for the 0.7 release was to move the external
> parser library references to separate extractor classes so that the
> parser class could be instantiated without problems. Unfortunately
> this was a one-off solution that obviously hasn't survived further
> development in the svn trunk.
> The reason why I originally didn't want to simply catch and ignore the
> potential exceptions in the TikaConfig constructor was the lack of a
> good error reporting mechanism. The trick of insulating the external
> library dependencies to separate extractor classes nicely solved that
> problem by delaying the exceptions to the actual parse() method calls
> on specific document types, which obviously would then give the end
> user a much better idea of what's wrong.
> Perhaps the best solution would actually be to combine the above
> approaches, i.e. to strive to maintain the parser/extractor separation
> where possible and to use a catch block in the TikaConfig constructor
> to catch and ignore any problems that the insulation approach fails to
> address.

IIRC, the main concern about this approach is when people are using  
custom parsers, where instantiation exceptions can happen due to bugs  
in the actual parser (versus explicitly excluded jars). Quietly  
ignoring these errors leads to late failing, which can be a bad thing.

What I would propose is two changes:

1. Add a new TikaConfig(ClassLoader, Class<Parser>...) constructor  
that can be used to instantiate all parsers from the variable list  
that around found using the ClassLoader. For example:

     public TikaConfig(ClassLoader loader,  
             throws MimeTypeException, IOException {
         for (Class<Parser> parserClass : targetParsers) {
             ParseContext context = new ParseContext();

             try {
                 Parser parser = parserClass.newInstance();
                 for (MediaType type :  
parser.getSupportedTypes(context)) {
                     parsers.put(type, parser);
             } catch (InstantiationException e) {
                 throw new IOException(e);
             } catch (IllegalAccessException e) {
                 throw new IOException(e);

         mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");

2. Add a TikaConfig.setDefaultConfig() static method, so that callers  
can set the default config that might get used in various places.

One question here is that the current TikaConfig.getDefaultConfig()  
method has this comment:

      * Provides a default configuration (TikaConfig).  Currently  
creates a
      * new instance each time it's called; we may be able to have it
      * return a shared instance once it is completely immutable.

Any insight into this comment? I see that it was based on https://issues.apache.org/jira/browse/TIKA-34

 From what I can tell, making TikaConfig immutable would require  
wrapping the parsers map in a nonmodifiable map, and a bit more  
serious modifications to MediaTypes (registry, types, magics, xmls,  
patterns) to be able to create an immutable version of that.

The above changes would let me instantiate the TikaConfig that I need,  
without having to dup/edit/keep in sync any XML files, and make sure  
that all of the Tika code base uses this configuration particular  


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message