tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
Date Sun, 02 Sep 2012 14:23:07 GMT
Michael McCandless created TIKA-987:
---------------------------------------

             Summary: Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
                 Key: TIKA-987
                 URL: https://issues.apache.org/jira/browse/TIKA-987
             Project: Tika
          Issue Type: Bug
            Reporter: Michael McCandless
             Fix For: 1.3


I have two Word docs, both containing the same drawing, but one has
text added.

In one case (picture.doc) the extraction is correct: it contains only
an embedded image.wmf; when I view the image it's correct.

In the second case (picture_3.doc) the picture is extracted as image
(no extension), and is 0 bytes, and there is an invalid character
(mapped to unicode replacement char) inserted before the image:

{noformat}
<title/>
</head>
<body><p>�<img src="embedded:image1" alt="image1"/></p>
<p/>
<p/>
<p>vehicle
</p>
{noformat}

(Though, the text "vehicle" is extracted correctly).

I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
MERGEFORMAT} field, which we invoke
WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
the 0-byte no-extension image as well as the invalid character.  With
the first doc there is no field (at least not one that's handle with
handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
fix... it could be something is going wrong in how POI parses the
Pictures from PictureSource.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message