nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: Tutorial
Date Mon, 08 Aug 2005 17:13:17 GMT
+1

Piotr Kosiorowski wrote:
> Hello,
> Some time ago someone mentioned on the list a problem with nutch
> tutorial (I cannot find this email now). I have checked it today and
> he/she was right.  If you follow the nutch Intranet Crawling tutorial
> you will end up with not very interesting index.
> This is because it recommends users to set urlfilter and urls file for
> nutch.org domain, but www.nutch.org redirects to
> http://lucene.apache.org/nutch and all links are rejected by
> urlfilter.
> 
> So I suggest to change it so:
> urls file will contain: http://lucene.apache.org/nutch
> crawl-urlfilter.txt will contain:
> +^http://([a-z0-9]*\.)*apache.org/
> I would also add pdf and png to list of rejected extensions in
> crawl-urlfilter.txt file so users would not be confused by errors in
> log file. pdf parsing plugin is disabled in default configuration.
> I can commit such changes for 0.7 release (it means today) if I got
> positive feedback from other committers.
> Regards
> Piotr

Mime
View raw message