nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juergen Specht <juergen_spe...@shakodo.com>
Subject Re: Nutch Parser annoyingly faulty
Date Fri, 04 Mar 2011 02:07:24 GMT
Thanks Scott!

Since I wrote this email - which I thought got ignored by the
Nutch developers - I am getting bombed on my server by 2 especially
annoying and non-reacting companies which use Nutch. The companies
(and Nutch) are both blocked by my robots.txt file, see:

http://www.shakodo.com/robots.txt

but while they both access this file a couple of times
per day, they ignore it completely.
The company http://www.lijit.com/ called me an "idiot" to
complain about their faulty configuration and the other
company http://www.comodo.com/ ignored every complaint.

Can you please check if my robots.txt file has the correct
syntax and if I reject Nutch in general correctly or can you
please help me to fix the syntax that Nutch powered crawler
don't access our server(s) anymore? If the syntax in fact is
correct, then I must assume that at least these 2 companies
altered the source to actively abuse the robots.txt rules.

Doesn't this violate your license?

Help is appreciated!

Juergen
-- 
Shakodo - The road to profitable photography: http://www.shakodo.com/


On 3/4/11 10:40 AM, Scott Gonyea wrote:
> Has anyone looked into this?  This is especially a problem when folks
> like Juergen are a customer and, quite rightfully, raise hell.  I
> wasn't aware of this, since Nutch is a software metaphor for a
> firehose.  But what I have noticed is that the URL Parser is really,
> really terrible.  Expletive-worthy.
>
> The problem I am experiencing is the lack of subdomain support.
> Dumping thousands of regexes into a flatfile is a terrible hack.  More
> than that, pushing meta-data down through a given site becomes
> unreliable.  If one site links to another, and that sites links are
> crawled, your meta data is now unreliable.
>
> Etc.  I don't want to come across as whiney, but I just did.  I really
> think Nutch needs to hunker down tests.  I'm guilty of not caring
> about it myself, but it's because testing Java is pretty painful
> compared to BDD tools like RSpec:
>
> http://www.codecommit.com/blog/java/the-brilliance-of-bdd
>
> Scott
>
> On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
> <juergen_specht@shakodo.com>  wrote:
>> Hi Nutch Team,
>>
>> before I permanently reject Nutch from all my sites, I better tell
>> you why...your URL parser is extremely faulty and creates a lot of
>> trouble.
>>
>> Here is an example, if you have a link on a page, say:
>>
>> http://www.somesite/somepage/
>>
>> and the link in HTML looks like:
>>
>> <a href=".">This Page</a>
>>
>> the parser should identify that the "." (dot) refers
>> to this URL:
>>
>> http://www.somesite/somepage/
>>
>> and not to:
>>
>> http://www.somesite/somepage/.
>>
>> Every single browser does it correctly, why not Nutch?
>>
>> Why is this important? Many new sites don't use the traditional
>> mapping of directories from the URL model anymore, but instead
>> have controllers, actions, parameters etc. encoded in the URL.
>>
>> They get split by a separator, which often is "/" (slash), so if
>> there is a trailing dot, it requests a different resource than
>> without the dot. By ignoring the dot in the backend to cope with
>> Nutch' faulty parser would create at least 2 URL's sending the
>> same content, which then again might affect your Google ranking.
>>
>> Also, Nutch parses "compressed" Javascript files, which are all
>> written in one long line, then somehow take part of the code and
>> add it to the URL, creating a huge array of 404's on the server
>> side.
>>
>> Example, you have a URL to a Javascript file like this:
>>
>>   http://www.somesite/javascript/foo.js
>>
>> Nutch parses this and then accesses random (?) new URLs which look like:
>>
>> http://www.somesite/javascript/someFunction();
>>
>> etc etc.
>>
>> Please, please, please fix Nutch!
>>
>> Thanks,
>>
>> Juergen
>> --
>> Shakodo - The road to profitable photography: http://www.shakodo.com/
>>


Mime
View raw message