tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Story (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2179) WordMLParser fails to parse a word xml file
Date Tue, 15 Nov 2016 19:12:59 GMT
Sean Story created TIKA-2179:
--------------------------------

             Summary: WordMLParser fails to parse a word xml file
                 Key: TIKA-2179
                 URL: https://issues.apache.org/jira/browse/TIKA-2179
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.14
         Environment: OSX, java 8
            Reporter: Sean Story
            Priority: Minor


h3. Problem
I have a sample word.xml file that can be parsed by neither OOXMLParser (yields an exception
that was {{Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied
data appears to be a raw XML file. Formats such as Office 2003 XML are not supported}}) nor
by OfficeParser (yields an exception like: {{org.apache.poi.poifs.filesystem.NotOLE2FileException:
The supplied data appears to be a raw XML file. Formats such as Office 2003 XML are not supported}}

I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the source, built, and
updated my tika version to 1.14. However, when parsing with WordMLParser, the output text
content I get is the empty string {{""}}, but I'm expecting something more like:
{noformat}
It means that the guy that you are trading with was reported for a scam attempt. As the others
mentioned, some of these BOFA could be false.
What's important is the current trade that you are doing.
If everything seems to be in order then there is nothing wrong with going through with the
trade.
Auti, Sneha (QAPM)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message