tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sara Miller (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2177) microsoft.OfficeParser shows add links in additional paragraphs
Date Mon, 14 Nov 2016 15:20:58 GMT
Sara Miller created TIKA-2177:

             Summary: microsoft.OfficeParser shows add links in additional paragraphs
                 Key: TIKA-2177
                 URL: https://issues.apache.org/jira/browse/TIKA-2177
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 1.13
         Environment: org.apache.tika.parser.microsoft.OfficeParser and org.apache.tika.parser.microsoft.ooxml.OOXMLParser
            Reporter: Sara Miller
            Priority: Minor

I'm converting Excel files, both .xls and .xlsx.
.xls uses org.apache.tika.parser.microsoft.OfficeParser and 
.xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser

If I have a link in my excel document, for example santa@gmail.com, the .xls parser adds additional
elements in the document structure which shows an incorrect output of how the document looks.

For example, this table in file.xls: 
mailadress	password
santa@gmail.com	hohoho

will output: 
 <div class="page">
            <div class="outside">
                <a href="mailto:santa@gmail.com">santa@gmail.com</a>

The <div class="outside"> should be removed because it does not correspond to the document

This message was sent by Atlassian JIRA

View raw message