tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed
Date Fri, 09 Sep 2011 09:27:09 GMT

    [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101087#comment-13101087

Jukka Zitting commented on TIKA-704:

Hmm, there was still a hidden copy of the Yamaha manual in the test file. I removed that in
revision 1167056, which also brought down the size of the file from 3.9MB to a more comfortable

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>         Attachments: LicensedTestWithOutlook.docx, LicensedTestWithPdf.docx, TestWithOutlook.docx,
TestWithPdf.docx, recursiveUsage.txt
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained
in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem
lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified
as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After
a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore
does not call the PDFParser on them.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message