nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rod Taylor (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly
Date Sat, 03 Dec 2005 19:34:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12359237 ] 

Rod Taylor commented on NUTCH-98:
---------------------------------

According to the Googlebot faq their implementation takes the longest matching URL as the
one they obey.

See point 7 of http://www.google.com/webmasters/bot.html.

Also, there's a small difference between the way Googlebot handles the robots.txt file and
the way the robots.txt standard says we should (keeping in mind the distinction between "should"
and "must"). The standard says we should obey the first applicable rule, whereas Googlebot
obeys the longest (that is, the most specific) applicable rule. This more intuitive practice
matches what people actually do, and what they expect us to do. For example, consider the
following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin 

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly
decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the
*longest* rule that matches.  I will attach a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message