tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-985) Support for HTML5 elements
Date Thu, 25 Jul 2013 15:13:49 GMT

     [ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated TIKA-985:
-------------------------------

    Attachment: TIKA-985-1.5.patch

Dirty patch for Tika 1.5. This patch allows for headings (h1...h6) to be embedded inside elements
like anchors etc. This is allowed in HTML5 and some pages already use this. Without this patch
headings are reported out of order as SAX events.
                
> Support for HTML5 elements
> --------------------------
>
>                 Key: TIKA-985
>                 URL: https://issues.apache.org/jira/browse/TIKA-985
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, TIKA-985-1.3-3.patch,
TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, section). This
prevents some custom ContentHandlers from reading expected elements and/or attributes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message