tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Gullion (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2036) Deleted Text from Word File Shows Up in Extract
Date Sat, 16 Jul 2016 21:06:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380946#comment-15380946

Steve Gullion commented on TIKA-2036:

Experimenting with Apache POI, I find that XWPFRun.getText(), the most low-level text extractor,
returns NULL if the text in a run has been deleted. So some code up the food chain is actually
putting the deleted text back in. Just sayin.

> Deleted Text from Word File Shows Up in Extract
> -----------------------------------------------
>                 Key: TIKA-2036
>                 URL: https://issues.apache.org/jira/browse/TIKA-2036
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>         Environment: Windows, under TikaOnDotNet
>            Reporter: Steve Gullion
>              Labels: word
> A .docx file, with "track changes" on, includes deleted text. In this case, there are
two overlapping deletions:
> 9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED,
THEN DELETED:Arizona] New York law] (Intentionally omitted.)
> The text should only include "9. (Intentionally omitted)". However, the output is "9.
This Agreement shall be governed and construed in accordance with New York law." So it recognizes
"Arizona" as deleted, but not the rest of it.
> Edit: this is worse than I originally thought. ALL deleted text is showing up in text
exported from other Word docs. I saw this reported in 2011, and there was supposedly a patch,
but apparently it doesn't work, or something else was changed. Is there an option somewhere
that provides for the exclusion of deleted text generally?

This message was sent by Atlassian JIRA

View raw message