nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-710) Support for rel="canonical" attribute
Date Wed, 09 Apr 2014 14:30:24 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964202#comment-13964202
] 

Sebastian Nagel commented on NUTCH-710:
---------------------------------------

Thanks,  [~Sertac Turkel]! My comments:
* every page containing a canonical link is now rejected. That's a rather hard decision. It
should be configurable whether pages containing correct (non-empty, not self-referential,
etc.) canonical links
*# are unconditionally rejected
*# are removed later only if the target is indexed. It's close to deduplication, and it's
what canonical links are intended for: give web masters a chance to support and influence
deduplication.
*# are only recorded (as outlinks and/or as indexed fields)
This point is the most challenging one: you need to take care for all nasty situations "in
the wild", e.g. a canonical link pointing to a redirect which leads you back to the current
page, etc. It's required to "resolve" chains of canonical links in combination with redirects,
see Julien's comment and [1|http://mail-archives.apache.org/mod_mbox/nutch-user/201203.mbox/%3CCA+-fM0sg=rvuNxzoez5NLFmhNJHta=qP5qHTfRJ8ii55fB2mJA@mail.gmail.com%3E].
* is it really necessary to handle canonical links explicitely in DbUpdateMapper and mark
as injected? Couldn't this be done by adding them simply as outlinks? Per default links of
"link" elements are added as outlinks, cf. parser.html.outlinks.ignore_tags. Of course, canonical
links should be added even if "link" elements are ignored.
* extraction of canonical links: at least, the following points are missing: relative URLs,
and canonical link inside HTTP headers (required for anything which is not HTML). I'll try
support you in this point because there's already some work done.
* keep names in parallel?
{code}src/plugin/parse-html/.../TestDOMContentUtils.java
src/plugin/parse-tika/.../DOMContentUtilsTest.java
{code}

... and some useful references:
[http://en.wikipedia.org/wiki/Canonical_link_element]
[http://tools.ietf.org/html/rfc6596]
[https://support.google.com/webmasters/answer/139066]
[http://www.mattcutts.com/blog/rel-canonical-html-head/]
[http://googlewebmastercentral.blogspot.de/2011/06/supporting-relcanonical-http-headers.html]


> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-710.patch, canonical.patch
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of URLs crawled
and indexed and reduce duplicate page content.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message