tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2179) WordMLParser fails to parse a word xml file
Date Mon, 28 Nov 2016 17:19:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702546#comment-15702546
] 

Hudson commented on TIKA-2179:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x-windows #79 (See [https://builds.apache.org/job/tika-2.x-windows/79/])
TIKA-2179  --  add detection and parsing for word2006ml files -- this (tallison: rev 1bb7c33846203900c1ec791c7a2a958912da2a9c)
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> WordMLParser fails to parse a word xml file
> -------------------------------------------
>
>                 Key: TIKA-2179
>                 URL: https://issues.apache.org/jira/browse/TIKA-2179
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: OSX, java 8
>            Reporter: Sean Story
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>         Attachments: File5.xml
>
>
> h3. Problem
> I have a sample word xml file (attached as File5.xml) that can be parsed by neither OOXMLParser
(yields an exception that was {{Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException:
The supplied data appears to be a raw XML file. Formats such as Office 2003 XML are not supported}})
nor by OfficeParser (yields an exception like: {{org.apache.poi.poifs.filesystem.NotOLE2FileException:
The supplied data appears to be a raw XML file. Formats such as Office 2003 XML are not supported}}
> I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the source, built,
and updated my tika version to 1.14. However, when parsing with WordMLParser, the output text
content I get is the empty string {{""}}, but I'm expecting something more like:
> {noformat}
> It means that the guy that you are trading with was reported for a scam attempt. As the
others mentioned, some of these BOFA could be false.
> What's important is the current trade that you are doing.
> If everything seems to be in order then there is nothing wrong with going through with
the trade.
> Auti, Sneha (QAPM)
> {noformat}
> h3. Replication
> You can replicate with the below Spock test
> {noformat}
>     def "display error with WordMLParser"(){
>         setup:
>         File input = new File("/Users/sstory/Downloads/File5.xml") //modify for your
path
>         Parser parser = new WordMLParser()
>         //Parser parser = new OOXMLParser()
>         //Parser parser = new OfficeParser()
>         org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
>         Metadata metadata = new Metadata()
>         ParseContext context = new ParseContext()
>         
>         when:
>         parser.parse(input.newInputStream(), textHandler, metadata, context)
>         String result = textHandler.toString()
>         then:
>         !result.isEmpty()
>         result.contains("the guy that you are trading with")
>         result.contains("BOFA")
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message