tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
Date Tue, 12 Nov 2013 11:07:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820013#comment-13820013

Nick Burch commented on TIKA-1194:

I've had a quick look, and WordExtractor from Apache POI skips the text too

My first hunch would be that it's something to do with text fields

Any chance you could step through the parser in a debugger, checking the text of the ranges
around the point of the missing text, and see if there's anything odd going on?

> Missing text from MS Word (DOC) file
> ------------------------------------
>                 Key: TIKA-1194
>                 URL: https://issues.apache.org/jira/browse/TIKA-1194
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Tomas Safarik
>            Priority: Critical
>         Attachments: OP-06-015.doc
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line (in table
cell) in the original document.
> - If you add or remove one character anywhere before the problematic line/cell then the
filtered text is correct. If you get the text back to original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is needed.
> Regards,
> Tomas

This message was sent by Atlassian JIRA

View raw message