nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: robots.txt redirect (NUTCH-124)
Date Fri, 03 Apr 2009 17:56:02 GMT
Hi Mathijs,

I've posted a patch for this on
https://issues.apache.org/jira/browse/NUTCH-731

HTH

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/3/17 Mathijs Homminga <mathijs.homminga@gmail.com>

> Hi everybody,
>
> Can someone shine a light on NUTCH-124:
> RobotRulesParser.java doesn't follow redirects when requesting the
> robots.txt file. Doug patched this, but that didn't make it to the trunk.
> What is the wished behavior here?
>
>
> For example, when requesting the following url:
> http://7is7.com/software/stateye/download/stateye097f.html
>
> ... RobotRulesParser requests the following robots.txt:
> http://7is7.com/robots.txt
>
> ... however, that file doesn't exist, it redirects to:
> http://www.7is7.com/robots.txt
>
> ... that robots.txt tells us the initial url is disallowed.
> But does it really? Or is robots.txt file only applicable to
> http://www.7is7.com and not http://7is7.com.
>
> So the question is: should we follow such redirects?
>
> Thanks,
> Mathijs
>

Mime
View raw message