nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject Website mirroring
Date Wed, 23 Mar 2005 15:12:28 GMT
I've been pondering the appropriateness of Nutch for website mirroring (and subsequent searching),
basically Teleport Pro-like functionality. 

I've already patched Nutch to do this by including hard-coded rules, like only add links from
a page if its within the same domain. The current URL filtering mechanism can be extended
to provide support for more flexible url filtering (like domain-only, host-only), but this
doesn't belong in a whole-web crawling application.

I guess I'm looking for a hybrid between the Heritrix crawler (http://crawler.archive.org/)
and Nutch for ~2 mill. pages. I can always build a Lucene search interface on top of Heritix
instead of using Nutch, but the ARC format doesn't seem amenable to runtime "cached page"
retrieval, and more importantly, it'll probably take much more time than I have, in order
to add in all the auxiliary stuff and infrastructure that Nutch already provides..

Any thoughts? 

kelvin


Mime
View raw message