tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes
Date Fri, 25 May 2018 12:28:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490642#comment-16490642
] 

ASF GitHub Bot commented on TIKA-2100:
--------------------------------------

GerardBouchar opened a new pull request #238: TIKA-2100 extract content language from html
lang attribute
URL: https://github.com/apache/tika/pull/238
 
 
   The [recommended way](https://www.w3.org/International/questions/qa-html-language-declarations)
of declaring the language of an HTML document is to specify it in the `lang` attribute of
the `<html>` tag.
   
   Tika currently not only ignores this attribute, but also removes it from its SAX output,
making it unavailable to client applications.
   
   This PR adds the value of the lang attribute to the documents metadata (under the existing
key `Metadata.CONTENT_LANGUAGE`) and also makes it available in the SAX output.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
>                 Key: TIKA-2100
>                 URL: https://issues.apache.org/jira/browse/TIKA-2100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Parsing a very simple html like 
>  <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html> 
> you won't be able to access the html tag's attributes (here lang="en") in the ContentHandler
: 
> *in the method startElement(String ns, String localName, String name,
>       Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the HtmlMapper.mapSafeAttribute
method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message