tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1193) Allow access to HtmlParser's HtmlSchema
Date Fri, 29 Nov 2013 16:56:35 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated TIKA-1193:
--------------------------------

    Attachment: TIKA-1193-trunk.patch

Yes, i agree. Here's a new patch plus unit test using a customized schema to obtain achor
text from tables inside an anchor.

> Allow access to HtmlParser's HtmlSchema
> ---------------------------------------
>
>                 Key: TIKA-1193
>                 URL: https://issues.apache.org/jira/browse/TIKA-1193
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: TIKA-1193-trunk.patch, TIKA-1193-trunk.patch
>
>
> TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly
handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema
applications can modify the schema to suit their needs on the fly.
> This would also mean that we don't have to rely on TIKA-985 getting committed, we can
change it from our own applications.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message