nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Bowden (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly
Date Thu, 29 Sep 2005 06:28:47 GMT
RobotRulesParser interprets robots.txt incorrectly
--------------------------------------------------

         Key: NUTCH-98
         URL: http://issues.apache.org/jira/browse/NUTCH-98
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
    Reporter: Jeff Bowden
    Priority: Minor


Here's a simple example that the current RobotRulesParser gets wrong:

User-agent: *
Disallow: /
Allow: /rss


The problem is that the isAllowed function takes the first rule that matches and incorrectly
decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the
*longest* rule that matches.  I will attach a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message