tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-705) Valid OOXML PPT file hits InvalidFormatException thrown in POI
Date Mon, 05 Sep 2011 21:09:09 GMT

    [ https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097319#comment-13097319
] 

Nick Burch commented on TIKA-705:
---------------------------------

I'll need to read the spec to be sure, but I have a feeling it could be our issue with not
removing anchors before fetching parts.

Either way we probably want to make it easier for people to get related parts anyway, as the
current method is a bit more fiddly that we really want.

This will probably largely all be done on the POI side though, with the only Tika bit being
moving to the new, simpler code once available

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other
doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall
not hold any characters other than pchar characters. [M1.6]
> 	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
> 	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
> 	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
> 	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
> 	... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, and save
it.  Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message