nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juergen Specht <juergen_spe...@shakodo.com>
Subject Nutch Parser annoyingly faulty
Date Sat, 26 Feb 2011 00:08:06 GMT
Hi Nutch Team,

before I permanently reject Nutch from all my sites, I better tell
you why...your URL parser is extremely faulty and creates a lot of
trouble.

Here is an example, if you have a link on a page, say:

http://www.somesite/somepage/

and the link in HTML looks like:

<a href=".">This Page</a>

the parser should identify that the "." (dot) refers
to this URL:

http://www.somesite/somepage/

and not to:

http://www.somesite/somepage/.

Every single browser does it correctly, why not Nutch?

Why is this important? Many new sites don't use the traditional
mapping of directories from the URL model anymore, but instead
have controllers, actions, parameters etc. encoded in the URL.

They get split by a separator, which often is "/" (slash), so if
there is a trailing dot, it requests a different resource than
without the dot. By ignoring the dot in the backend to cope with
Nutch' faulty parser would create at least 2 URL's sending the
same content, which then again might affect your Google ranking.

Also, Nutch parses "compressed" Javascript files, which are all
written in one long line, then somehow take part of the code and
add it to the URL, creating a huge array of 404's on the server
side.

Example, you have a URL to a Javascript file like this:

  http://www.somesite/javascript/foo.js

Nutch parses this and then accesses random (?) new URLs which look like:

http://www.somesite/javascript/someFunction();

etc etc.

Please, please, please fix Nutch!

Thanks,

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/

Mime
View raw message