nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
Date Mon, 07 Jan 2019 11:14:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735682#comment-16735682
] 

ASF GitHub Bot commented on NUTCH-2683:
---------------------------------------

sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: add option to
prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
 
 
   - add optional value "httpsOverHttp" to -compareOrder argument to prefer https:// over
http:// if it comes before the "urlLength" and neither "score" nor "fetchTime" take precedence
   - code improvements: remove nested loop, sort imports, add `@Override` statements where
applicable
   
   Testing with one pair of https/http duplicates:
   ```
   % cat seeds.txt 
   http://nutch.apache.org/
   https://nutch.apache.org/
   
   % nutch inject crawldb seeds.txt
   ...
   
   % nutch generate crawldb/ segments
   ...
   
   % nutch fetch segments/*
   ...
   
   % nutch parse segments/*
   ...
   
   % nutch updatedb crawldb/ segments/*
   ...
   
   % nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime
   ...
   Deduplication: 1 documents marked as duplicates
   ...
   
   % nutch readdb crawldb/ -url https://nutch.apache.org/
   URL: https://nutch.apache.org/
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Feb 06 11:55:33 CET 2019
   Modified time: Mon Jan 07 11:55:33 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.1800001
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   
   % nutch readdb crawldb/ -url http://nutch.apache.org/
   URL: http://nutch.apache.org/
   Version: 7
   Status: 7 (db_duplicate)
   Fetch time: Wed Feb 06 11:55:39 CET 2019
   Modified time: Mon Jan 07 11:55:39 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.1800001
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   ```
   The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp" is configured.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a set of
duplicates, marking all longer ones as duplicates. Recently search engines started to penalize
non-https pages by [giving https pages a higher rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
and [marking http as insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be able to
prefer https:// over http:// URLs, although the latter ones are shorter by one character.
Of course, this should be configurable and in addition to existing preferences (length, score
and fetch time) to select the "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message