tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2033) Value attributes of input elements not extracted from HTML
Date Thu, 14 Jul 2016 20:54:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378358#comment-15378358

Ken Krugler commented on TIKA-2033:

Do you have a suggestion for how the text should appear in the resulting document? E.g. as-is,
or with "input " preceding it, or something else?

> Value attributes of input elements not extracted from HTML 
> -----------------------------------------------------------
>                 Key: TIKA-2033
>                 URL: https://issues.apache.org/jira/browse/TIKA-2033
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.10
>         Environment: Windows 7, java8 x64
>            Reporter: Luis Filipe Nassif
>            Priority: Minor
> The text of value attributes of input elements currently is not extracted from HTML files.
Note it is rendered by browsers. I tried using IdentityHtmlMapper and played with HtmlSchema
with no luck. Simple test HTML below:
> <HTML><body><input value='text'></input></body></HTML>

This message was sent by Atlassian JIRA

View raw message