nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Website mirroring
Date Wed, 23 Mar 2005 17:32:59 GMT
Kelvin Tan wrote:
> I've been pondering the appropriateness of Nutch for website mirroring (and subsequent
searching), basically Teleport Pro-like functionality. 
> I've already patched Nutch to do this by including hard-coded rules, like only add links
from a page if its within the same domain. The current URL filtering mechanism can be extended
to provide support for more flexible url filtering (like domain-only, host-only), but this
doesn't belong in a whole-web crawling application.

Nutch is certainly not meant to only be a whole-web-crawling 
application.  A URL filter can be a plugin, so why not submit your patch 
as a plugin that's disabled for most folks.  Then folks who want the 
functionality you describe can simply specify your URL filter plugin. 
You can supply sample config files & documentation with the plugin. 
Does that sound workable?


View raw message