tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1017) DefaultHtmlMapper misses some safe elements
Date Wed, 07 Nov 2012 13:37:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492342#comment-13492342

Jukka Zitting commented on TIKA-1017:

The idea behind DefaultHtmlMapper is to try to normalize and simplify the incoming HTML as
much as possible while still preserving the semantic structure of the document. We can add
extra elements if there's a good use case that's not already covered by the IdentifyHtmlMapper
> DefaultHtmlMapper misses some safe elements
> -------------------------------------------
>                 Key: TIKA-1017
>                 URL: https://issues.apache.org/jira/browse/TIKA-1017
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Daniel Bonniot de Ruisselet
> The code of DefaultHtmlMapper says that the list of "safe" elements is based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
> Elements like <sub> and <i> are not included in the safe list. Is this intentional
(a comment with the rationale would be useful) or should they be added?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message