nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kosiorowski <pkosiorow...@gmail.com>
Subject Tutorial
Date Mon, 08 Aug 2005 12:37:59 GMT
Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right.  If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to set urlfilter and urls file for
nutch.org domain, but www.nutch.org redirects to
http://lucene.apache.org/nutch and all links are rejected by
urlfilter.

So I suggest to change it so:
urls file will contain: http://lucene.apache.org/nutch
crawl-urlfilter.txt will contain:
+^http://([a-z0-9]*\.)*apache.org/
I would also add pdf and png to list of rejected extensions in
crawl-urlfilter.txt file so users would not be confused by errors in
log file. pdf parsing plugin is disabled in default configuration.
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.
Regards
Piotr

Mime
View raw message