incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Scherler <thors...@apache.org>
Subject Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly
Date Sun, 05 Apr 2009 20:01:45 GMT
On Sun, 2009-04-05 at 00:44 +0100, Robin Howlett wrote:
> I was just looking through NoRobotClient and have concern whether Droids
> will actually respect robots.txt when force allow is false in most
> scenarios; consider the following robots.txt:

It is easier to have a test class to debug this.

> 
> User-agent: *
> Disallow: /foo/
> 
> and the starting URI: http://www.example.com/foo/bar.html
> 
> In the code I see - in NoRobotClient.isUrlAllowed() - the following:
> 
> String path = uri.getPath();
> String basepath = baseURI.getPath();

The base path in our example is http://www.example.com.

> if (path.startsWith(basepath)) {
>  path = path.substring(basepath.length());
>  if (!path.startsWith("/")) {
>    path = "/" + path;
>  }
> }

path is /foo/bar.html

> ...
> 
> Boolean allowed = this.rules != null ? this.rules.isAllowed( path ) : null;
> if(allowed == null) {
> allowed = this.wildcardRules != null ? this.wildcardRules.isAllowed( path )
> : null;
> }
> if(allowed == null) {
> allowed = Boolean.TRUE;
> }
> 
> The path will always be converted to /bar.html and is checked against the
> Rules in rules and wildcardRules but won't be found. However, basepath (which
> will now be /foo) is never checked against the Rules, therefore giving an
> incorrect true result for the isUrlAllowed method, no?

Hmm, see above, I disagree but have not debug yet. will do that now.

salu2

> robin
-- 
´╗┐Thorsten Scherler <thorsten.at.apache.org>
Open Source <consulting, training and solutions>


Mime
View raw message