tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudia Mickiewicz (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2807) .docx text extract leaves out rich text content-control inside of a text box
Date Mon, 07 Jan 2019 12:43:00 GMT
Claudia Mickiewicz created TIKA-2807:
----------------------------------------

             Summary: .docx text extract leaves out rich text content-control inside of a
text box
                 Key: TIKA-2807
                 URL: https://issues.apache.org/jira/browse/TIKA-2807
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.20
            Reporter: Claudia Mickiewicz
         Attachments: test-document.docx

When parsing a Microsoft Word .docx, Rich Text Content Control nested inside of a Text Box
remain unextracted.

I have attached a .docx file that can be tested against. 

 

"_rich-text-content-control_inside-text-box_" remains unextracted while "rich-text-content-control "
and "_simple text_" are extracted without any problem. ** 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message