tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2807) .docx text extract leaves out rich text content-control inside of a text box
Date Mon, 07 Jan 2019 14:40:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735894#comment-16735894
] 

Tim Allison edited comment on TIKA-2807 at 1/7/19 2:39 PM:
-----------------------------------------------------------

This is the body that is extracted by the beta-level SAXDocxExtractor:

{noformat}
<body><p>simple-text</p>
<p><a name="fefweefwe" /><b>rich-text-content-control_inside-text-box</b>
<b />
<img /></p>
<p>rich-text-content-control</p>
<a name="_GoBack" /><div class="glossary"><p>Klicken oder tippen Sie hier,
um Text einzugeben.</p>
</div>
</body>
{noformat}

To select that vs the traditional DOM parser, see: https://wiki.apache.org/tika/MSOfficeParsers

I'm going to do a bit more digging to figure out if we can extract this with the DOM parser
without major changes to POI.


was (Author: tallison@mitre.org):
This is the body that is extracted by the beta-level SAXDocxExtractor:

{noformat}
<body><p>simple-text</p>
<p><a name="fefweefwe" /><b>rich-text-content-control_inside-text-box</b>
<b />
<img /></p>
<p>rich-text-content-control</p>
<a name="_GoBack" /><div class="glossary"><p>Klicken oder tippen Sie hier,
um Text einzugeben.</p>
</div>
</body></html>
{noformat}

To select that vs the traditional DOM parser, see: https://wiki.apache.org/tika/MSOfficeParsers

I'm going to do a bit more digging to figure out if we can extract this with the DOM parser
without major changes to POI.

> .docx text extract leaves out rich text content-control inside of a text box
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-2807
>                 URL: https://issues.apache.org/jira/browse/TIKA-2807
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20
>            Reporter: Claudia Mickiewicz
>            Priority: Critical
>         Attachments: test-document.docx
>
>
> When parsing a Microsoft Word .docx, Rich Text Content Control nested inside of a Text
Box remain unextracted.
> I have attached a .docx file that can be tested against. 
>  
> "_rich-text-content-control_inside-text-box_" remains unextracted while "rich-text-content-control "
and "_simple text_" are extracted without any problem. ** 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message