nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Tutorial
Date Mon, 08 Aug 2005 17:13:17 GMT

Piotr Kosiorowski wrote:
> Hello,
> Some time ago someone mentioned on the list a problem with nutch
> tutorial (I cannot find this email now). I have checked it today and
> he/she was right.  If you follow the nutch Intranet Crawling tutorial
> you will end up with not very interesting index.
> This is because it recommends users to set urlfilter and urls file for
> domain, but redirects to
> and all links are rejected by
> urlfilter.
> So I suggest to change it so:
> urls file will contain:
> crawl-urlfilter.txt will contain:
> +^http://([a-z0-9]*\.)*
> I would also add pdf and png to list of rejected extensions in
> crawl-urlfilter.txt file so users would not be confused by errors in
> log file. pdf parsing plugin is disabled in default configuration.
> I can commit such changes for 0.7 release (it means today) if I got
> positive feedback from other committers.
> Regards
> Piotr

View raw message