tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Gullion (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2036) Deleted Text from Word File Shows Up in Extract
Date Sat, 16 Jul 2016 15:48:20 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Gullion updated TIKA-2036:
--------------------------------
    Description: 
A .docx file, with "track changes" on, includes deleted text. In this case, there are two
overlapping deletions:

9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED,
THEN DELETED:Arizona] New York law] (Intentionally omitted.)

The text should only include "9. (Intentionally omitted)". However, the output is "9. This
Agreement shall be governed and construed in accordance with New York law." So it recognizes
"Arizona" as deleted, but not the rest of it.

Edit: this is worse than I originally thought. ALL deleted text is showing up in text exported
from other Word docs. I saw this reported in 2011, and there was supposedly a patch, but apparently
it doesn't work, or something else was changed. Is there an option somewhere that provides
for the exclusion of deleted text generally?

  was:
A .docx file, with "track changes" on, includes deleted text. In this case, there are two
overlapping deletions:

9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED,
THEN DELETED:Arizona] New York law] (Intentionally omitted.)

The text should only include "9. (Intentionally omitted)". However, the output is "9. This
Agreement shall be governed and construed in accordance with New York law." So it recognizes
"Arizona" as deleted, but not the rest of it.


> Deleted Text from Word File Shows Up in Extract
> -----------------------------------------------
>
>                 Key: TIKA-2036
>                 URL: https://issues.apache.org/jira/browse/TIKA-2036
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>         Environment: Windows, under TikaOnDotNet
>            Reporter: Steve Gullion
>              Labels: word
>
> A .docx file, with "track changes" on, includes deleted text. In this case, there are
two overlapping deletions:
> 9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED,
THEN DELETED:Arizona] New York law] (Intentionally omitted.)
> The text should only include "9. (Intentionally omitted)". However, the output is "9.
This Agreement shall be governed and construed in accordance with New York law." So it recognizes
"Arizona" as deleted, but not the rest of it.
> Edit: this is worse than I originally thought. ALL deleted text is showing up in text
exported from other Word docs. I saw this reported in 2011, and there was supposedly a patch,
but apparently it doesn't work, or something else was changed. Is there an option somewhere
that provides for the exclusion of deleted text generally?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message