tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Bonniot de Ruisselet (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1167) Embedded object not extracted
Date Wed, 28 Aug 2013 11:48:52 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209
] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1167 at 8/28/13 11:47 AM:
-----------------------------------------------------------------------------

After further analysis, I think support for such cases probably needs to be done in POI (but
comments welcome if someone has further insight). I posted comments and tentative a patch
to this POI bug: https://issues.apache.org/bugzilla/show_bug.cgi?id=51891

Even if that works out well, it would probably be useful to add a test at the Tika level as
well. The OLE parsing seems rather sensitive (for a reason, the format itself looks messy
and poorly documented). Also, integration of POI and Tika is seems tight. So it can only help
to test things work at different levels.
                
      was (Author: dbr):
    After further analysis, I think support for such cases probably needs to be done in POI
(but comments welcome if someone has further insight). I'm working on submitting an issue
and probably a tentative a patch there. Will link to it here when it exists.

Even if that works out well, it would probably be useful to add a test at the Tika level as
well. The OLE parsing seems rather sensitive (for a reason, the format itself looks messy
and poorly documented). Also, integration of POI and Tika is seems tight. So it can only help
to test things work at different levels.
                  
> Embedded object not extracted
> -----------------------------
>
>                 Key: TIKA-1167
>                 URL: https://issues.apache.org/jira/browse/TIKA-1167
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by this tag:
> {{<div class="embedded" id="rId10"/>}}
> However, extraction itself (using -z on the command line, or using the API) does not
seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to /tmp/tika/rId9_image1.wmf}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message