tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1228) Embedded files not extracted properly from PDF
Date Mon, 03 Feb 2014 18:08:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889697#comment-13889697
] 

Tim Allison commented on TIKA-1228:
-----------------------------------

I won't have time to fix this for a week or so, but it looks like the client (Tika) needs
to look through the kids of embeddedFiles recursively (well, in this file, just one level
down) to get the non-null embeddedFileNames.

Something like this does pull out the .doc file:

{no-format}
Map<String, COSObjectable> embeddedFileNames = embeddedFiles.getNames();
List<PDNameTreeNode> kids = embeddedFiles.getKids();
    for (PDNameTreeNode n : kids){
        Map<String, COSObjectable> embeddedFileNames = n.getNames();
        processEmbedded(embeddedFileNames, embeddedExtractor);
....
{no-format}

where processEmbedded is shorthand for the existing code:
{no-format}
if (embeddedFileNames != null){
...
}
{no-format}

We can fix this at the Tika level in the short term.  I'm not sure if this is the expected
behavior in PDFBox.  At the least we might want to request that this line in the javadoc to
PDDocumentNameDictionary: ("The value in this name tree will be PDComplexFileSpecification
objects.") be changed to "The value in this name tree or its children will be PDComplexFileSpecification
objects.")

> Embedded files not extracted properly from PDF
> ----------------------------------------------
>
>                 Key: TIKA-1228
>                 URL: https://issues.apache.org/jira/browse/TIKA-1228
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>         Environment: CentOS 6.5 VM
>            Reporter: Jason Sherman
>              Labels: easyfix
>         Attachments: pdf_with_doc_and_text_attached.pdf
>
>
> IAW pdfbox example here:
> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
> the PDF parser does not check for additional entries under Kids node when Names node
does not exist.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message