nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kosiorowski <>
Subject Tutorial
Date Mon, 08 Aug 2005 12:37:59 GMT
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right.  If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to set urlfilter and urls file for domain, but redirects to and all links are rejected by

So I suggest to change it so:
urls file will contain:
crawl-urlfilter.txt will contain:
I would also add pdf and png to list of rejected extensions in
crawl-urlfilter.txt file so users would not be confused by errors in
log file. pdf parsing plugin is disabled in default configuration.
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.

View raw message