nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Felix Zett (JIRA)" <>
Subject [jira] [Created] (NUTCH-2720) ROBOTS metatag ignored when capitalized
Date Thu, 23 May 2019 15:10:00 GMT
Felix Zett created NUTCH-2720:

             Summary: ROBOTS metatag ignored when capitalized
                 Key: NUTCH-2720
             Project: Nutch
          Issue Type: Bug
          Components: indexer, robots
    Affects Versions: 1.15
            Reporter: Felix Zett
         Attachments: noindex.html

As discussed [on the mailing list|],
index-metadata fails to ignore a webpage with a capitalized robots metatag such as {{<META
NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">}}. This only applies when parse-tika is used.
parse-html will "decapitalize"

Parsing the attached [^noindex.html] leads to the following results:

bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata"
-Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots""metatag.robots"

Parse Metadata: [...] metatag.robots=noindex,nofollow robots=noindex,nofollow{code}

bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata"
-Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots""metatag.robots"

Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW  [...] ROBOTS=NOINDEX,NOFOLLOW [...]{code}

The field being named "ROBOTS" and not "robots" leads to {{parseData.getMeta("robots")}} being {{null}}
in [].

This message was sent by Atlassian JIRA

View raw message