nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Bowden (JIRA)" <>
Subject [jira] Created: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly
Date Thu, 29 Sep 2005 06:28:47 GMT
RobotRulesParser interprets robots.txt incorrectly

         Key: NUTCH-98
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
    Reporter: Jeff Bowden
    Priority: Minor

Here's a simple example that the current RobotRulesParser gets wrong:

User-agent: *
Disallow: /
Allow: /rss

The problem is that the isAllowed function takes the first rule that matches and incorrectly
decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the
*longest* rule that matches.  I will attach a patch that fixes this.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message