nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Bouchar (JIRA)" <>
Subject [jira] [Created] (NUTCH-2589) HTML redirections are not followed when using parse-tika
Date Tue, 29 May 2018 16:01:00 GMT
Gerard Bouchar created NUTCH-2589:

             Summary: HTML redirections are not followed when using parse-tika
                 Key: NUTCH-2589
             Project: Nutch
          Issue Type: Bug
         Environment: nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


fetched url:
            Reporter: Gerard Bouchar

Html redirections using meta tags are supported in nutch. They work well when using parse-html
to parse files. However, when using parse-tika, they are not detected.

This is because of

Tika emits redirection meta tags as :

<meta name="refresh" content="0; url="/>

whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following
format :

<meta http-equiv="refresh" content="0; url=">

This message was sent by Atlassian JIRA

View raw message