nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <>
Subject Re: Nutch Parser annoyingly faulty
Date Fri, 04 Mar 2011 10:09:16 GMT
Hi Jurgen,

> Since I wrote this email - which I thought got ignored by the
> Nutch developers -

Thanks for reporting the problem Jurgen. and sorry that you felt you were
being ignored. The few active developers Nutch has contribute during their
spare time, the reason why you did not get any comments on this, is that no
one had an instant answer or time to investigate in more details. You
definitely raised an important issue which is worth investigating.

To answer your first email : the javascript parser is notoriously noisy and
generates all sorts of monstrosities. It used to be activated by default but
this won't be the case as of the forthcoming 1.3 release.

I have not been able to reproduce the issue with the dot though. I put this

<a href=".">This Page</a>

on our server :

ran : ./nutch org.apache.nutch.parse.ParserChecker

and got

Outlinks: 1
  outlink: toUrl: anchor: This Page

as expected.

Any particular URL on your site that you had this problem with?

> I am getting bombed on my server by 2 especially
> annoying and non-reacting companies which use Nutch. The companies
> (and Nutch) are both blocked by my robots.txt file, see:

> but while they both access this file a couple of times
> per day, they ignore it completely.
> The company called me an "idiot" to
> complain about their faulty configuration and the other
> company ignored every complaint.

By default, Nutch does respect robots.txt and the community as a whole
encourages server-politeness and reasonable use however we can't prevent
people from using ridiculous settings (e.g. high number of threads per host,
low time gap between calls) or modifying the code to bypass the robots
checking (see my comment below)

> Can you please check if my robots.txt file has the correct
> syntax and if I reject Nutch in general correctly or can you
> please help me to fix the syntax that Nutch powered crawler
> don't access our server(s) anymore?

I have checked your robots.txt and it looks correct. I tried parsing with the user-agents you specified, Nutch fully
respected robots.txt and the content has not been fetched

> If the syntax in fact is
> correct, then I must assume that at least these 2 companies
> altered the source to actively abuse the robots.txt rules.

That's indeed a possibility

> Doesn't this violate your license?

not as far as I know. The Apache license allows people to modify the code,
most people do that for positive reasons and unfortunately we can't prevent
people from bypassing the robots check.

> Help is appreciated!

Another option is to see if the companies you want to block use constantly
the same IP range and configure your servers so that they prevent access to
these IPs.  You could file a complain with the company hosting the crawl, I
know that Amazon are pretty reactive with EC2 and would take measures to
make sure their users do the right things



*Open Source Solutions for Text Engineering

View raw message