nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi (JIRA)" <>
Subject [jira] Created: (NUTCH-101) RobotRulesParser
Date Fri, 30 Sep 2005 05:41:47 GMT

         Key: NUTCH-101
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Fuad Efendi

I noticed this code in protocol-http & protocol-httpclient plugins:

      } else if ( (line.length() >= 6)
                  && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {

However, according to the original 1994 protocol description, there is NO "Allow:" field.
To allow, simply use "Disallow:  ".

Please, try to test with
- their site has this:
User-agent: *

And Nutch does not work with New Egg, but it should!

Sorry guys, I don't have enough time to double-ensure, could you please verify all this...

I noticed strange discussion at, it seems that we need to test

User-agent: ia_archiver
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: TurnitinBot
Disallow: /    

- everything according to standard protocol. Can you retest please whether it works with multiline?
It's a standard!

I see this in code:
   StringTokenizer tok = new StringTokenizer(agentNames, ",");
Comma separated? It's not accepted standard yet...

Sorry WebExpertsAmerica, I really didn't have any time to make any test...

Please do not execute tests against production sites.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message