tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pascal Essiembre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2352) Incorrect EOF exception in WordPerfect parser
Date Thu, 04 May 2017 18:51:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997216#comment-15997216
] 

Pascal Essiembre commented on TIKA-2352:
----------------------------------------

I had time to look further at one of the file in lists: "govdocs1\318\318891.wp".  It puzzles
me and I feel I must be missing something obvious.  LibreOffice opens it fine.

It is read just fine until the last page where there is an isolated "1" in the middle of the
page.  The sequence of interest is "31 02 02 DA D0 04 D0", which can be broken down as follow:

31 - The number "1"
02 - Control character indicating to print a page number
02 - Control character indicating to print a page number
DA - Variable-length function (218) for a "box group"
D0 - Subfunction code 208.  INVALID, possible values range from 0 to 6.
04 D0 - function length 53252 (two bytes, reverse order).  INVALID, greater than what's left.

So I do not know why this invalid function code is there and how LibreOffice interprets it
fine.  It may be the 0x02 also throwing things off... since it is the only place those characters
are found in the document and it goes wrong after that.

In other context (non WP docs), the ASCII standard for 0x02 is "STX -> Start of Text ->
First character of message text", and may be used to terminate the message heading"

Since there is a page number in the middle, it could be that the page/document is ended there
and a new one is appended?  If so, not sure then how 0x02 should be treated in relation to
that.

> Incorrect EOF exception in WordPerfect parser
> ---------------------------------------------
>
>                 Key: TIKA-2352
>                 URL: https://issues.apache.org/jira/browse/TIKA-2352
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Trivial
>             Fix For: 2.0, 1.15
>
>         Attachments: 462321.wp, reports.zip
>
>
> We have a few EOF exceptions in WordPerfect files that are likely not truncated.  The
example I'll attach shortly is able to be opened without complaint by LibreOffice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message