nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-101) RobotRulesParser
Date Fri, 19 Jun 2009 21:16:08 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722014#action_12722014
] 

Ken Krugler commented on NUTCH-101:
-----------------------------------

1. Not sure if the reported problem with "Disallow:" was fixed, or never existed, but the
1.0 code base has no issues with this.

2. Not sure if the reported problem with parsing multiple agent names was fixed, or never
existed, but this code:

StringTokenizer tok = new StringTokenizer(agentNames, ",");

is now

StringTokenizer agentTokenizer = new StringTokenizer(agentNames);

Which means it will break on space, tab, return, etc. (white space) but not ','

3. The reported problem with:

StringTokenizer lineParser= new StringTokenizer(content, "\n\r");

doesn't exist. StringTokenizer will break on either \n or \r, and if these occur together
(e.g. DOS line endings) then it still works properly because the empty string it finds between
the \n and the \r isn't returned (treated as an empty token). You could add support for the
more esoteric endings described above, but that doesn't seem very important.

4. The minor bug in main() appears to have been fixed. The code is now:

      String[] robotNames= new String[argv.length - 2];

So I think this issue can be closed.

> RobotRulesParser
> ----------------
>
>                 Key: NUTCH-101
>                 URL: https://issues.apache.org/jira/browse/NUTCH-101
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.6, 0.7, 0.7.1, 0.8
>            Reporter: Fuad Efendi
>
> I noticed this code in protocol-http & protocol-httpclient plugins:
>       } else if ( (line.length() >= 6)
>                   && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {
> However, according to the original 1994 protocol description, there is NO "Allow:" field.
To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html
> Please, try to test with www.newegg.com/robots.txt
> - their site has this:
> User-agent: *
> Disallow: 
> And Nutch does not work with New Egg, but it should!
> Sorry guys, I don't have enough time to double-ensure, could you please verify all this...
> I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need
to test ......./robots.txt
> User-agent: ia_archiver
> Disallow: /
> User-agent: Googlebot-Image
> Disallow: /
> User-agent: Nutch
> Disallow: /
> User-agent: TurnitinBot
> Disallow: /    
> - everything according to standard protocol. Can you retest please whether it works with
multiline? It's a standard!
> I see this in code:
>    StringTokenizer tok = new StringTokenizer(agentNames, ",");
>  
> Comma separated? It's not accepted standard yet...
> Sorry WebExpertsAmerica, I really didn't have any time to make any test...
> Please do not execute tests against production sites.
> Thanks!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message