nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects
Date Wed, 31 Jan 2018 22:58:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347735#comment-16347735
] 

Markus Jelsma commented on NUTCH-2466:
--------------------------------------

Hello Moreno,

Well, we obviously could allow a -1 setting and treat that as forever, but forever is infinite
and it would hang the Nutch task until Hadoop treats it as timed out, usually within ten minutes.

The setting is an int, so if you want, you can set it to the maximum positive integer and
handle just over two billion consecutive redirects. Y

I believe that would justify the meaning of forever in this context, do you agree?

As a side note, having dealt with the crudeness of the www for many years, i consider any
sequence of more than four redirects as the root a whole other problem. Our (company, not
asf nutch) maximum setting is always three, higher than that has, so far, always lead to circular
redirects.


> Sitemap processor to follow redirects
> -------------------------------------
>
>                 Key: NUTCH-2466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2466
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.15
>
>         Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. sitemap_index.xml
that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message