nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "FAQ" by GodmarBack
Date Thu, 07 Jan 2010 00:45:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by GodmarBack.


-       <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
+       <value>protocol-file|...copy original values from nutch-default here...</value>
+ where you should copy and paste all values from nutch-default.xml in the plugin.includes
setting provided there. This will ensure that all plug-in normally enabled will be enabled,
plus the protocol-file plugin. Make sure to include parse-pdf if you want to parse PDF files.
Make sure that urlfilter-regexp is included, or else '''the *urlfilter files will be ignored''',
leading nutch to accept all URLs. You need to enable crawl URL filters to prevent nutch from
crawling up the parent directory, see below.
  Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha
is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with
http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click
on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned
[[|here]] and this behavior
may be disabled by a [[|preference]]
(see security.checkloaduri). IE5 does not have this problem.

View raw message