tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1193) Allow access to HtmlParser's HtmlSchema
Date Mon, 18 Nov 2013 22:11:22 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825832#comment-13825832

Jukka Zitting commented on TIKA-1193:

A cleaner approach would probably be to allow the caller to pass a custom schema through the
ParseContext object:

ParseContext context = new ParseContext();
context.set(Schema.class, ...);
parser.parse(..., context);

The {{HtmlParser}} class could then get the custom schema from the context:

Schema schema = context.get(Schema.class, HTML_SCHEMA);
parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty, schema);

> Allow access to HtmlParser's HtmlSchema
> ---------------------------------------
>                 Key: TIKA-1193
>                 URL: https://issues.apache.org/jira/browse/TIKA-1193
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>         Attachments: TIKA-1193-trunk.patch
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly
handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema
applications can modify the schema to suit their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting committed, we can
change it from our own applications.

This message was sent by Atlassian JIRA

View raw message