tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2208) Catch missing libraires
Date Fri, 16 Dec 2016 14:21:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754538#comment-15754538
] 

Tim Allison commented on TIKA-2208:
-----------------------------------

So, I think that should be your solution for now, unless [~gagravarr] can think of any unintended
consequences, or unless that is too broad for your use case, [~dadoonet].

However, there are two potential issues that we may want to address:

1) Even with the full Tika with all of its dependencies, I'm getting this:
{noformat}
Caused by: java.lang.ClassNotFoundException: com.microsoft.schemas.office.visio.x2012.main.ConnectsType
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
{noformat}
I think this means that we should add this test file to POI so that the appropriate classes
are loaded into our slimmed down ooxml-schemas...right, Nick?

2) I found it annoying that we have to turn off the full super-type "x-tika-ooxml", when we
might want to turn off only one subtype, e.g. "x-tika-visio-ooxml" or one subsubtype, e.g.
"vnd.ms-visio.drawing".  In other words, when I tried to exclude "vnd.ms-visio-drawing", our
exclusion mechanism didn't work.  Do we want to fix this?

> Catch missing libraires
> -----------------------
>
>                 Key: TIKA-2208
>                 URL: https://issues.apache.org/jira/browse/TIKA-2208
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract text and
metadata.
> We defined our list of Parsers:
> {code:java}
>     private static final Parser PARSERS[] = new Parser[] {
>         // documents
>         new org.apache.tika.parser.html.HtmlParser(),
>         new org.apache.tika.parser.rtf.RTFParser(),
>         new org.apache.tika.parser.pdf.PDFParser(),
>         new org.apache.tika.parser.txt.TXTParser(),
>         new org.apache.tika.parser.microsoft.OfficeParser(),
>         new org.apache.tika.parser.microsoft.OldExcelParser(),
>         new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
>         new org.apache.tika.parser.odf.OpenDocumentParser(),
>         new org.apache.tika.parser.iwork.IWorkPackageParser(),
>         new org.apache.tika.parser.xml.DcXMLParser(),
>         new org.apache.tika.parser.epub.EpubParser(),
>     };
>     private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
>     private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(),
PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document (Like a Visio
Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a {{TikaException}}
so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message