nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-101) RobotRulesParser
Date Fri, 30 Sep 2005 05:41:47 GMT
RobotRulesParser
----------------

         Key: NUTCH-101
         URL: http://issues.apache.org/jira/browse/NUTCH-101
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Fuad Efendi


I noticed this code in protocol-http & protocol-httpclient plugins:

      } else if ( (line.length() >= 6)
                  && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {


However, according to the original 1994 protocol description, there is NO "Allow:" field.
To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html

Please, try to test with www.newegg.com/robots.txt
- their site has this:
User-agent: *
Disallow: 

And Nutch does not work with New Egg, but it should!

Sorry guys, I don't have enough time to double-ensure, could you please verify all this...

I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test
......./robots.txt

User-agent: ia_archiver
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: TurnitinBot
Disallow: /    


- everything according to standard protocol. Can you retest please whether it works with multiline?
It's a standard!

I see this in code:
   StringTokenizer tok = new StringTokenizer(agentNames, ",");
 
Comma separated? It's not accepted standard yet...

Sorry WebExpertsAmerica, I really didn't have any time to make any test...

Please do not execute tests against production sites.
Thanks!




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message