nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Stuart <david.stu...@progressivealliance.co.uk>
Subject Re: [jira] Created: (NUTCH-926) Nutch follows wrong url in <META http-equiv="refresh" tag
Date Wed, 27 Oct 2010 14:50:58 GMT
Have you tried restricting the crawl range in the regex-urlfilter.txt  instead of having

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

Change to

# Crawl right domain
+^www.rightdomain.com

# Deny anything else
-.



David


On 27 Oct 2010, at 15:46, Marco Novo (JIRA) wrote:

>  If that we will spider all the web....


Mime
View raw message