tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-268) HTMLParser ommits necessary space-characters when parsing table-data
Date Tue, 18 Aug 2009 09:38:14 GMT

    [ https://issues.apache.org/jira/browse/TIKA-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744436#action_12744436
] 

Uwe Schindler commented on TIKA-268:
------------------------------------

The problem is, that the HTML parser strips all tags, that are not in SAFE_ELEMENTS. <TABLE>
tags are replaced by <P> and all inner tags simply ignored and not passed through. As
all other ContentHandlers (like OOXML, OpenXML,..) produce XHTML table tags, the HTML parser
should preserve the table. This can be achieved by modifying the SAFE_ELEMENTS map.

If you then convert the output to text-only, the output will contain tabs and NLs, as XHTMLContentHandler
adds ignorableWhiteSpace between table tags and newlines after HTML block tags.

> HTMLParser ommits necessary space-characters when parsing table-data 
> ---------------------------------------------------------------------
>
>                 Key: TIKA-268
>                 URL: https://issues.apache.org/jira/browse/TIKA-268
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3, 0.4
>         Environment: Win, Mac, Lin; Java 5+
>            Reporter: Joachim Zittmayr
>            Priority: Critical
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When an HTML file with a table structure is given to the TIKA-ecosystem, then HTML parser
doesn't output space characters between table cells.
> Example:
> Input
> ------------------------------
> <table>
>   <tr>
>     <td>Apache LUCENE<td><td>is f****** amazing!</td>
>  </tr>
>  <tr>
>     <td>Apache TIKA</td><td>freaks you out!</td>
>  </tr>
> <table>
> ------------------------------
> Output
> ------------------------------
> Apache LUCENEis f****** amazing!
> Apache TIKAfreaks you out!
> ------------------------------
> unfortuantely i didnt have the time to do some investigation within HTMLParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message