tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joachim Zittmayr (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data
Date Tue, 18 Aug 2009 08:59:14 GMT
HTMLParser ommits necessary space-characters when parsing table-data 
---------------------------------------------------------------------

                 Key: TIKA-268
                 URL: https://issues.apache.org/jira/browse/TIKA-268
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
         Environment: Win, Mac, Lin; Java 5+
            Reporter: Joachim Zittmayr
            Priority: Critical


When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser
doesn't output space characters between table cells.

Example:

Input
------------------------------
<table>
  <tr>
    <td>Apache LUCENE<td><td>is f****** amazing!</td>
 </tr>
 <tr>
    <td>Apache TIKA</td><td>freaks you out!</td>
 </tr>
<table>
------------------------------

Output
------------------------------

Apache LUCENEis f****** amazing!

Apache TIKAfreaks you out!

------------------------------

unfortuantely i didnt have the time to do some investigation within HTMLParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message