nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (NUTCH-101) RobotRulesParser
Date Sat, 20 Jun 2009 04:14:07 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic resolved NUTCH-101.
------------------------------------

    Resolution: Fixed

Thank you Ken.

> RobotRulesParser
> ----------------
>
>                 Key: NUTCH-101
>                 URL: https://issues.apache.org/jira/browse/NUTCH-101
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.6, 0.7, 0.7.1, 0.8
>            Reporter: Fuad Efendi
>
> I noticed this code in protocol-http & protocol-httpclient plugins:
>       } else if ( (line.length() >= 6)
>                   && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {
> However, according to the original 1994 protocol description, there is NO "Allow:" field.
To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html
> Please, try to test with www.newegg.com/robots.txt
> - their site has this:
> User-agent: *
> Disallow: 
> And Nutch does not work with New Egg, but it should!
> Sorry guys, I don't have enough time to double-ensure, could you please verify all this...
> I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need
to test ......./robots.txt
> User-agent: ia_archiver
> Disallow: /
> User-agent: Googlebot-Image
> Disallow: /
> User-agent: Nutch
> Disallow: /
> User-agent: TurnitinBot
> Disallow: /    
> - everything according to standard protocol. Can you retest please whether it works with
multiline? It's a standard!
> I see this in code:
>    StringTokenizer tok = new StringTokenizer(agentNames, ",");
>  
> Comma separated? It's not accepted standard yet...
> Sorry WebExpertsAmerica, I really didn't have any time to make any test...
> Please do not execute tests against production sites.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message