tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-669) Backup plan for parsing
Date Sun, 01 Mar 2015 22:21:04 GMT

     [ https://issues.apache.org/jira/browse/TIKA-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tyler Palsulich closed TIKA-669.
    Resolution: Duplicate

> Backup plan for parsing
> -----------------------
>                 Key: TIKA-669
>                 URL: https://issues.apache.org/jira/browse/TIKA-669
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
> Currently once a document type has been detected we direct the document to the one parser
that best matches the detected type. In practice there are cases where that parser finds that
it in fact cannot parse this document, for example when something that looked like XML turns
out to have syntax errors. For such cases it would be nice if the CompositeParser could then
retry parsing the document with a more generic backup parser, like the plain text parser for
malformed XML.
> Implementing this would require some level of buffering and redirection of both parser
input and output. Input buffering is easy, but for output buffering we'd probably need to
implement new ContentHandler and Metadata layers.

This message was sent by Atlassian JIRA

View raw message