tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly
Date Tue, 08 Nov 2016 19:33:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648562#comment-15648562

Tim Allison commented on TIKA-2159:

For the general solution, I see two options:

1) store the stacktrace in the container's metadata with a key signifying an exception when
trying to read an embedded stream.
2) send the exception (along with a zero-length bytestream) through to the embedded parser
which then has to check to see if there's already been an exception.

The second might be a bit more work, but it would more closely align what the user sees when
there's a ParseException on an embedded object and when there's an exception just trying to
get the stream before trying to parse the embedded object.

With the first option, the user would have to check for stacktraces in the embedded docs (as
stored by the RecursiveParserWrapper) _and_ stacktraces stored in the container files.

Preferences or other options?

> Handle pre-parse embedded object exceptions uniformly and more robustly
> -----------------------------------------------------------------------
>                 Key: TIKA-2159
>                 URL: https://issues.apache.org/jira/browse/TIKA-2159
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Tim Allison
>            Priority: Minor
> When an embedded document is parsed and causes an exception, we're currently catching
that and swallowing it in ParsingEmbeddedDocumentExtractor (the default) or reporting it in
the RecursiveParserWrapper by storing the stacktrace in the Metadata of the embedded document.
> However, if there's an exception during detection on the embedded stream or on getting
the stream _before_ the stream hits the parser, we aren't handling that uniformly or robustly
across parsers.

This message was sent by Atlassian JIRA

View raw message