tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2354) Missing many embedded images in .doc files
Date Thu, 04 May 2017 02:21:04 GMT
Tim Allison created TIKA-2354:
---------------------------------

             Summary: Missing many embedded images in .doc files
                 Key: TIKA-2354
                 URL: https://issues.apache.org/jira/browse/TIKA-2354
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Blocker


On a slightly deeper look at the comparison results between 1.14 and trunk, it looks like
we're missing quite a few embedded images from .doc files.  I initially thought these could
be explained by different handling of macros, but that's not the issue.

I haven't traced the commit that did it (very likely my fault), but...
when we call this with a null character run.
{noformat}
        // Handle any pictures that we haven't output yet
        for (Picture p = pictures.nextUnclaimed(); p != null; ) {
            handlePictureCharacterRun(
                    null, p, pictures, xhtml
            );
            p = pictures.nextUnclaimed();
        }
{noformat}

the null character run then triggers skipping of the picture in this check because {{isRendered(cr)}}
returns false if {{cr}} is {{null}}

{noformat}
        if (!isRendered(cr) || picture == null) {
            // Oh dear, we've run out...
            // Probably caused by multiple \u0008 images referencing
            //  the same real image
            return;
        }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message