tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException
Date Tue, 12 Mar 2013 13:03:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600006#comment-13600006

Nick Burch commented on TIKA-1092:

I'm not sure that your problem file is actually a word document. The exception you're seeing
is triggered by POI trying to open the file, but discovering that it's not actually an OLE2
document. POI can't handle very old office documents (pre about 95, but it varies between
formats), but it can at least open the outer OLE2 container

Without the sample file I can't tell what your file actually is, but my best guess is that
someone has renamed it to be .doc when it isn't anything like that
> Parsing of old Word file causes a TikaException
> -----------------------------------------------
>                 Key: TIKA-1092
>                 URL: https://issues.apache.org/jira/browse/TIKA-1092
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>              Labels: office, parse, word-exception
> I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser.
This parser generates a Tika Exception when it try to parse very old file of Microsoft Word.
> I think this issue is not a priority because the files that cause the exception belong
to an obsolete format/structure that even new Microsoft Office versions don't support them,
but it's important to know that something wrong about these outdated types can happen.
> I report two links about old types (Microsoft support perspective):
> http://support.microsoft.com/?kbid=922850
> http://support.microsoft.com/kb/922849/it
> For example, the message of TikaException is below:
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
> 	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
> 	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 5 more

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message