tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joachim Zittmayr (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data
Date Tue, 18 Aug 2009 09:05:14 GMT

     [ https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joachim Zittmayr updated TIKA-268:
----------------------------------

    Affects Version/s:     (was: 0.5)
                       0.3

> HTMLParser ommits necessary space-characters when parsing table-data 
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Priority: Critical
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser
doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within HTMLParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message