tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Pilato (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2208) Catch missing libraires
Date Sun, 18 Dec 2016 13:50:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15758867#comment-15758867
] 

David Pilato commented on TIKA-2208:
------------------------------------

So we now have a regression in Elasticsearch tests.
We are testing that Tika test files are working correctly. For that we are using a subset
of https://github.com/apache/tika/tree/master/tika-parsers/src/test/resources/test-documents

Here, before we excluded {{x-tika-ooxml}} we were able to parse {{testPPT.potm}} file.
After applying the exclusion, the document is coming back empty. Before the change, that was
extracted:

{{code}}
Attachment Test
Rajiv
This is a test file data with the same content as every other file being tested for tika content
parsing. This has been developed by Rajiv Kumar Nistala.
Different words to test against
Quest
Hello
Watershed
Avalanche
Black Panther
Mystery
Banking
Investment
{{code}}

I think I'm just going to add the missing librairies as I don't think I can only exclude Visio
content, right?



> Catch missing libraires
> -----------------------
>
>                 Key: TIKA-2208
>                 URL: https://issues.apache.org/jira/browse/TIKA-2208
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract text and
metadata.
> We defined our list of Parsers:
> {code:java}
>     private static final Parser PARSERS[] = new Parser[] {
>         // documents
>         new org.apache.tika.parser.html.HtmlParser(),
>         new org.apache.tika.parser.rtf.RTFParser(),
>         new org.apache.tika.parser.pdf.PDFParser(),
>         new org.apache.tika.parser.txt.TXTParser(),
>         new org.apache.tika.parser.microsoft.OfficeParser(),
>         new org.apache.tika.parser.microsoft.OldExcelParser(),
>         new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
>         new org.apache.tika.parser.odf.OpenDocumentParser(),
>         new org.apache.tika.parser.iwork.IWorkPackageParser(),
>         new org.apache.tika.parser.xml.DcXMLParser(),
>         new org.apache.tika.parser.epub.EpubParser(),
>     };
>     private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
>     private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(),
PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document (Like a Visio
Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a {{TikaException}}
so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message