tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Bouchar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2652) HtmlParser generates incorrect meta tags
Date Fri, 25 May 2018 10:10:00 GMT
Gerard Bouchar created TIKA-2652:

             Summary: HtmlParser generates incorrect meta tags
                 Key: TIKA-2652
                 URL: https://issues.apache.org/jira/browse/TIKA-2652
             Project: Tika
          Issue Type: Bug
            Reporter: Gerard Bouchar

Whatever the input HTML meta are, tika's HTML meta can only have a "name" and a "content"
 attribute. This gives invalid HTML meta tags in the output.

For instance, the following valid HTML file

<!DOCTYPE html>
<html lang="en">
    <meta http-equiv="refresh" content="0; url=http://example.com">

is transformed into a SAX stream corresponding to the following HTML :

<html xmlns="http://www.w3.org/1999/xhtml">
<meta name="dc:title" content="Title"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="refresh" content="0; url=http://example.com"/>
<meta name="Content-Type" content="text/html; charset=ISO-8859-1"/>

The information that the original file had an "http-equiv" meta tag is lost, and replaced
by a generic "meta name=" tag.

This is annoying when working with classes expecting valid meta redirection, such as Nutch's
for instance.

This message was sent by Atlassian JIRA

View raw message