tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2208) Catch missing libraires
Date Fri, 16 Dec 2016 14:54:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754611#comment-15754611
] 

Tim Allison commented on TIKA-2208:
-----------------------------------

bq. I'm saying it's better because previously we were rejecting the document entirely.

Really?  Wow. 

I'll go ahead with 1) above and add the test file to POI some time today so that the extra
classes at least from this test file will be included in the next release of POI's ooxml-schemas.

If you have any opinion on 2), let us know.  [~gagravarr], your thoughts?

Finally, I'm sorry I was slow to respond to this issue.  Thank you, Nick, for the ping.  Nice
doing work with you and ES! 

Cheers!

> Catch missing libraires
> -----------------------
>
>                 Key: TIKA-2208
>                 URL: https://issues.apache.org/jira/browse/TIKA-2208
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: David Pilato
>
> Hi there
> We have decided to remove support for some formats when using Tika to extract text and
metadata.
> We defined our list of Parsers:
> {code:java}
>     private static final Parser PARSERS[] = new Parser[] {
>         // documents
>         new org.apache.tika.parser.html.HtmlParser(),
>         new org.apache.tika.parser.rtf.RTFParser(),
>         new org.apache.tika.parser.pdf.PDFParser(),
>         new org.apache.tika.parser.txt.TXTParser(),
>         new org.apache.tika.parser.microsoft.OfficeParser(),
>         new org.apache.tika.parser.microsoft.OldExcelParser(),
>         new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(),
>         new org.apache.tika.parser.odf.OpenDocumentParser(),
>         new org.apache.tika.parser.iwork.IWorkPackageParser(),
>         new org.apache.tika.parser.xml.DcXMLParser(),
>         new org.apache.tika.parser.epub.EpubParser(),
>     };
>     private static final AutoDetectParser PARSER_INSTANCE = new AutoDetectParser(PARSERS);
>     private static final Tika TIKA_INSTANCE = new Tika(PARSER_INSTANCE.getDetector(),
PARSER_INSTANCE);
> {code}
> But when a MS Office Word document embeds another non supported document (Like a Visio
Schema) an {{NoClassDefFoundError}} is raised.
> Would it be possible to catch such a case and throw in that case a {{TikaException}}
so it behaves as an Exception and not as a Throwable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message