nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Bowden (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly
Date Thu, 29 Sep 2005 20:19:47 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330867 ] 

Jeff Bowden commented on NUTCH-98:
----------------------------------

OK, so actually I'm wrong on two counts.  

1.  The current accepted standard does not have Allow lines

2. The draft standard does (http://www.robotstxt.org/wc/norobots-rfc.html), but it specifies
that the robot should take the first match found (Nutch's current implementation)

Any rule that is a prefix matched by an earlier rule is rendered completely non-effective
according to the standard.  My patch was motivated by what I thought was the obvious interpretation
given examples I've seen in the field.  The initial example I gave is from http://del.icio.us/robots.txt





> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly
decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the
*longest* rule that matches.  I will attach a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message