nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "FAQ" by GodmarBack
Date Wed, 06 Jan 2010 23:53:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by GodmarBack.
The comment on this change is: added useful link to Crawling the local filesystem page..


  Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha
is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with
http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click
on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned
[[|here]] and this behavior
may be disabled by a [[|preference]]
(see security.checkloaduri). IE5 does not have this problem.
- ==== Nutch crawling parent directories for file protocol ->  misconfigured URLFilters
+ ==== Nutch crawling parent directories for file protocol ====
+ If you find nutch crawling parent directories when using the file protocol, the following
kludge may help:
- [[]] E.g. for urlfilter-regex you should put
the following in regex-urlfilter.txt :
+ [[]] E.g. for urlfilter-regex you could put
the following in regex-urlfilter.txt :
+ Alternatively, you could apply the patch described [[|on
this page]], which would avoid the hardwiring of the site-specific /top/directory in your
configuration file.
  ==== How do I index remote file shares? ====

View raw message