tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sara Miller (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2177) microsoft.OfficeParser shows add links in additional paragraphs
Date Mon, 14 Nov 2016 15:20:58 GMT
Sara Miller created TIKA-2177:
---------------------------------

             Summary: microsoft.OfficeParser shows add links in additional paragraphs
                 Key: TIKA-2177
                 URL: https://issues.apache.org/jira/browse/TIKA-2177
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 1.13
         Environment: org.apache.tika.parser.microsoft.OfficeParser and org.apache.tika.parser.microsoft.ooxml.OOXMLParser
            Reporter: Sara Miller
            Priority: Minor


I'm converting Excel files, both .xls and .xlsx.
.xls uses org.apache.tika.parser.microsoft.OfficeParser and 
.xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser

If I have a link in my excel document, for example santa@gmail.com, the .xls parser adds additional
elements in the document structure which shows an incorrect output of how the document looks.


For example, this table in file.xls: 
mailadress	password
santa@gmail.com	hohoho

will output: 
 <div class="page">
            <h1>Sheet1</h1>
            <table>
                <tbody>
                    <tr>
                        <td>mailadress</td>
                        <td>password</td>
                    </tr>
                    <tr>
                        <td>santa@gmail.com</td>
                        <td>hohoho</td>
                    </tr>
                </tbody>
            </table>
            <div class="outside">
                <a href="mailto:santa@gmail.com">santa@gmail.com</a>
            </div>
        </div>

The <div class="outside"> should be removed because it does not correspond to the document
structure. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message