nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
Date Wed, 10 Apr 2019 11:36:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel resolved NUTCH-2683.
------------------------------------
    Resolution: Implemented

> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a set of
duplicates, marking all longer ones as duplicates. Recently search engines started to penalize
non-https pages by [giving https pages a higher rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
and [marking http as insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be able to
prefer https:// over http:// URLs, although the latter ones are shorter by one character.
Of course, this should be configurable and in addition to existing preferences (length, score
and fetch time) to select the "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message