tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cristian Vat (Jira)" <j...@apache.org>
Subject [jira] [Created] (TIKA-3008) Word Doc/Docx Formatting Extraction - Superscript/Subscript
Date Wed, 11 Dec 2019 10:25:00 GMT
Cristian Vat created TIKA-3008:
----------------------------------

             Summary: Word Doc/Docx Formatting Extraction - Superscript/Subscript
                 Key: TIKA-3008
                 URL: https://issues.apache.org/jira/browse/TIKA-3008
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.23
            Reporter: Cristian Vat


Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.

This changes the actual text extracted since character runs are merged together if only sup/sub
is the difference since it doesn't generate any tags in between.

Found to be especially problematic in case of some legal documents where getting "according
to Art 51" instead of "according to Art 5^1^" completely changes the meaning.

 

Problem seems to be both in old Word .doc and OOXML .docx formats parsing.

Sub/sup can be present on actual character run or on the document style assigned to a character
run.

 

I'm already working on fixes and test documents, will comment with work in progress branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message