tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema
Date Mon, 18 Nov 2013 21:59:21 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825823#comment-13825823
] 

Markus Jelsma commented on TIKA-1193:
-------------------------------------

Hi- are there any objections to putting this in? I know unit tests can break if applications
incorrectly modify the schema, e.g. removing the shape attrib from anchors, but that's the
responsibility of the application.. Perhaps marking it as expert would be satisfying?

> Allow access to HtmlParser's HtmlSchema
> ---------------------------------------
>
>                 Key: TIKA-1193
>                 URL: https://issues.apache.org/jira/browse/TIKA-1193
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly
handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema
applications can modify the schema to suit their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting committed, we can
change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message