nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christophe Noel <>
Subject Ignore external links from crawled domains
Date Fri, 05 Aug 2005 08:57:00 GMT

A very basic facility seem to be missing in Nutch. If I have a 2000 urls 
list in Nutch DB and want to ignore external links, I have to build a 
regex-filter with thousands of different domain I want to crawl. No 
parameter to only crawl the different domain and ignore external links.

At these times, is there another solution ? Has anybody worked on that ?

Thank you very much.

Christophe Noël.

View raw message