nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gerard Bouchar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2634) Some links marked as "nofollow" are followed anyway.
Date Tue, 14 Aug 2018 07:50:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gerard Bouchar updated NUTCH-2634:
----------------------------------
    Description: 
In order to check if an outlink in an <a> tag can be followed, nutch checks whether
the value of its rel attribute is the exact string string "nofollow".
However, [the rel attribute can contain a list of link types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel],
all of which should be respected.

So nutch rightfully doesn't follow a link like:
{code:html}
<a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
{code}

but wrongfully follows :
{code:html}
<a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
{code}

Because of the code duplication in nutch's html parsers, this should be fixed in two places:
# [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
# [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]

  was:
In order to check if an outlink in an <a> tag can be followed, nutch checks whether
the value of its rel attribute is the exact string string "nofollow".
However, the rel attribute can contain a list of link types, all of which should be respected.

So nutch rightfully doesn't follow a link like:
{code:html}
<a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
{code}

but wrongfully follows :
{code:html}
<a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
{code}

Because of the code duplication in nutch's html parsers, this should be fixed in two places:
# [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
# [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]


> Some links marked as "nofollow" are followed anyway.
> ----------------------------------------------------
>
>                 Key: NUTCH-2634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2634
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> In order to check if an outlink in an <a> tag can be followed, nutch checks whether
the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel],
all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
> {code}
> but wrongfully follows :
> {code:html}
> <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed in two
places:
> # [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> # [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message