tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Curt Arnold (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-207) MS word doc containing tracked changes produces incorrect text
Date Thu, 01 Sep 2011 19:10:10 GMT

     [ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Curt Arnold updated TIKA-207:
-----------------------------

    Attachment: TIKA-207.patch

Refined fix to suppress deleted text in .doc files. Will follow up with test cases later.

POI's API for .docx does not have the equivalent of CharacterRun.isMarkedDeleted. Will likely
file a POI bug and a new TIKA bug to accomplish the equivalent for .docx files.

> MS word doc containing tracked changes produces incorrect text
> --------------------------------------------------------------
>
>                 Key: TIKA-207
>                 URL: https://issues.apache.org/jira/browse/TIKA-207
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>         Environment: tika-0.3-standalone.jar
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: TIKA-207.patch
>
>
> Spinoff from this discussion:
>   http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> When extracting text from an MS Word doc (2003 format) that has
> unapproved pending changes, the text from both old and new is glommed
> together.
> EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> changed TOKENIZED to ANALYZED with track changes enabled, and
> then when I extract text (using TikaCLI) it produces this:
>   Field.Index.TOKENIZEDANALYZED
> So, first, it'd be nice to at least get whitespace inserted between
> old & new text.
> And, second, it'd be great to have an option to control whether it's
> old or new text that's indexed (or at least an option to only see
> "new" text, ie the current document).
> From the discussion above, it seems like POI may expose the
> fine-grained APIs to allow Tika to do this; it's just that Tika's not
> leveraging these APIs  for MS Word docs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message