nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Bowden (JIRA)" <>
Subject [jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly
Date Thu, 29 Sep 2005 20:19:47 GMT
    [ ] 

Jeff Bowden commented on NUTCH-98:

OK, so actually I'm wrong on two counts.  

1.  The current accepted standard does not have Allow lines

2. The draft standard does (, but it specifies
that the robot should take the first match found (Nutch's current implementation)

Any rule that is a prefix matched by an earlier rule is rendered completely non-effective
according to the standard.  My patch was motivated by what I thought was the obvious interpretation
given examples I've seen in the field.  The initial example I gave is from

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>          Key: NUTCH-98
>          URL:
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments:
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly
decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the
*longest* rule that matches.  I will attach a patch that fixes this.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message