tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Anderson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed
Date Thu, 01 Sep 2011 17:34:11 GMT
PDF and Outlook docs embedded in MS Word documents not parsed

                 Key: TIKA-704
                 URL: https://issues.apache.org/jira/browse/TIKA-704
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
         Environment: Windows 7 64-bit
            Reporter: Jeremy Anderson

Currently there appear to be issues with embedded pdf's and outlook Msg files contained in
MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem
lies in there).

>From what I see, when these embedded objects are parsed, they're initially identified
as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After
a call to the RecurciveParsers super parse class the Content-Types update to the following:

PDF's: application/vnd.ms-works
.MSG: application/x-tika-msoffice

The internal AutoDetectParser is unable to properly identify these PDF's and therfore does
not call the PDFParser on them.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message