nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Bouchar (JIRA)" <>
Subject [jira] [Created] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.
Date Tue, 14 Aug 2018 07:48:00 GMT
Gerard Bouchar created NUTCH-2634:

             Summary: Some links marked as "nofollow" are followed anyway.
                 Key: NUTCH-2634
             Project: Nutch
          Issue Type: Bug
            Reporter: Gerard Bouchar

In order to check if an outlink in an <a> tag can be followed, nutch checks whether
the value of its rel attribute is the exact string string "nofollow".
However, the rel attribute can contain a list of link types, all of which should be respected.

So nutch rightfully doesn't follow a link like:
<a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>

but wrongfully follows :
<a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>

Because of the code duplication in nutch's html parsers, this should be fixed in two places:
# [parse/html/|]
# [parse/tika/|]

This message was sent by Atlassian JIRA

View raw message