tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1124) Nested documents not extracted if a PDF file is in the chain
Date Mon, 05 Aug 2013 20:40:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729903#comment-13729903

Tim Allison commented on TIKA-1124:

Ok, I think I figured this out... AbstractOOXML includes contents from embedded documents
before calling handler.endDocument()
PDFParser, however, calls handler.endDocument() and then tries to append content from embedded
I think this means that the parent handler sees an end of body and therefore does not process
the contents of the embedded document.

trivial fix: move handler.endDocument() out of PDF2XHTML and call it after processing the
embedded documents in PDFParser.

Unless I hear otherwise, I'll commit this over the next few days.
> Nested documents not extracted if a PDF file is in the chain
> ------------------------------------------------------------
>                 Key: TIKA-1124
>                 URL: https://issues.apache.org/jira/browse/TIKA-1124
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: pdf_attachment_issues.zip
> Tika 1.3 is not able to get attachments from the attached PDF.
> The trunk is able to get attachments from the PDF.  However, if that PDF is then embedded
in another document, the docs embedded in the PDF are not extracted.
> I'm not sure of a solution, but I found two things that might help with the diagnosis:
> 1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler,
everything works (in trunk).
> 2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the
problem is also solved.
> The cause may be in the MatchingContentHandler.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message