tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cristian Vat (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-3008) Word Doc/Docx Formatting Extraction - Superscript/Subscript
Date Wed, 11 Dec 2019 19:25:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993842#comment-16993842

Cristian Vat commented on TIKA-3008:

Added parser test and sample documents to my branch.

Seems to work in most cases and all WordParser tests still pass.


But I have a .doc that I can't share where some superscript fails to get detected.

I'll try a little longer to duplicate the document somehow, but I have a feeling it might
be an issue for POI since during parsing the character run doesn't have the style id I expect
it to have.

> Word Doc/Docx Formatting Extraction - Superscript/Subscript
> -----------------------------------------------------------
>                 Key: TIKA-3008
>                 URL: https://issues.apache.org/jira/browse/TIKA-3008
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>            Reporter: Cristian Vat
>            Priority: Major
> Word extraction from .doc/.docx doesn't handle Superscript/Subscript at all.
> This changes the actual text extracted since character runs are merged together if only
sup/sub is the difference since it doesn't generate any tags in between.
> Found to be especially problematic in case of some legal documents where getting "according
to Art 51" instead of "according to Art 5^1^" completely changes the meaning.
> Problem seems to be both in old Word .doc and OOXML .docx formats parsing.
> Sub/sup can be present on actual character run or on the document style assigned to a
character run.
> I'm already working on fixes and test documents, will comment with work in progress branch.

This message was sent by Atlassian Jira

View raw message