tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2036) Deleted Text from Word File Shows Up in Extract
Date Sat, 16 Jul 2016 00:13:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380341#comment-15380341
] 

Luis Filipe Nassif commented on TIKA-2036:
------------------------------------------

There are use cases where deleted info is important, like forensic and auditing Fields. I
agree the default behaviour could be hiding deleted text, but I think an option must exist
to turn on deleted content extraction.

> Deleted Text from Word File Shows Up in Extract
> -----------------------------------------------
>
>                 Key: TIKA-2036
>                 URL: https://issues.apache.org/jira/browse/TIKA-2036
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>         Environment: Windows, under TikaOnDotNet
>            Reporter: Steve Gullion
>              Labels: word
>
> A .docx file, with "track changes" on, includes deleted text. In this case, there are
two overlapping deletions:
> 9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED,
THEN DELETED:Arizona] New York law] (Intentionally omitted.)
> The text should only include "9. (Intentionally omitted)". However, the output is "9.
This Agreement shall be governed and construed in accordance with New York law." So it recognizes
"Arizona" as deleted, but not the rest of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message