nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects
Date Fri, 20 Mar 2020 19:16:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063585#comment-17063585
] 

ASF GitHub Bot commented on NUTCH-2776:
---------------------------------------

sebastian-nagel commented on pull request #505: NUTCH-2776 Fetcher to temporarily deduplicate
followed redirects
URL: https://github.com/apache/nutch/pull/505
 
 
   - cache followed redirect targets for a configurable time (`fetcher.redirect.dedupcache.seconds`)
   - if a redirect target is found in cache it's skipped
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fetcher to temporarily deduplicate followed redirects
> -----------------------------------------------------
>
>                 Key: NUTCH-2776
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2776
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> If fetcher follows redirect (http.redirect.max > 0), it may happen that many redirects
of a site point to the same URL. In this situation, it might be good if fetcher could temporarily
(for a configurable time period) deduplicate the redirect targets and skip all redirects except
the first one. Typical examples of duplicated redirect targets are:
> - instead of responding with HTTP status 404:
> {noformat}
> /
> /resource-not-found
> /search/
> /404
> /error/not-found
> /err/notfound.html{noformat}
> - a page to accept/decline cookies
> {noformat}
> /cookie_usage.php
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message