tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-698) "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003
Date Fri, 02 Sep 2011 17:47:09 GMT

     [ https://issues.apache.org/jira/browse/TIKA-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting resolved TIKA-698.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks for reporting this! Fixed in revision 1164655.

> "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003
> ---------------------------------------------------------------
>
>                 Key: TIKA-698
>                 URL: https://issues.apache.org/jira/browse/TIKA-698
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.1-incubating, 0.9
>            Reporter: Pablo Queixalos
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: MS8.ppt
>
>
> Exception when parsing this MS PowerPoint file :  http://jeanferrette.free.fr/MS8.ppt
> java.io.IOException: Substitut UTF-16 non valide détecté : db00 bfff ?
>                 at com.sun.org.apache.xml.internal.serializer.ToStream.endElement(ToStream.java:2060)
>                 at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
>                 at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:215)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:169)
>                 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:234)
>                 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:271)
>                 at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:308)
>                 at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:41)
>                 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>                 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>                 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>                 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>                 [...]
> Parsing this file works fine with tika 0.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message