nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "FAQ" by SebastianNagel
Date Mon, 12 Jun 2017 21:36:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/FAQ?action=diff&rev1=139&rev2=140

Comment:
Update regarding parent directories, single slash after file:/, cf. NUTCH-1483

        <value>protocol-file|...copy original values from nutch-default here...</value>
      </property>
  }}}
- where you should copy and paste all values from nutch-default.xml in the plugin.includes
setting provided there. This will ensure that all plug-ins normally enabled will be enabled,
plus the protocol-file plugin. Make sure that urlfilter-regex is included, or else '''the
urlfilter files will be ignored''', leadingNnutch to accept all URLs. You need to enable crawl
URL filters to prevent Nutch from crawling up the parent directory, see below.
+ where you should copy and paste all values from nutch-default.xml in the plugin.includes
setting provided there. This will ensure that all plug-ins normally enabled will be enabled,
plus the protocol-file plugin.
  
  Now you can invoke the crawler and index all or part of your disk.
  
  ==== Nutch crawling parent directories for file protocol ====
- If you find Nutch crawling parent directories when using the file protocol, the following
Jira issue may help:
  
- http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you could put the
following in regex-urlfilter.txt :
+ By default, Nutch will step into parent directories. You can avoid this by setting the following
property to false:
  
  {{{
+ <property>
+   <name>file.crawl.parent</name>
+   <value>false</value>
+   <description>The crawler is not restricted to the directories that you specified
in the
+     Urls file but it is jumping into the parent directories as well. For your own crawlings
you can
+     change this behavior (set to false) the way that only directories beneath the directories
that you specify get
+     crawled.</description>
+ </property>
+ }}}
+ 
+ Alternatively, you could add a regex URL filter rule, e.g.
+ {{{
- +^file:///c:/top/directory/
+ +^file:/c:/top/directory/
  -.
  }}}
- Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on
this page]], which would avoid the hard-wiring of the site-specific /top/directory in your
configuration file.
+ - and don't forget to make sure that the plugin urlfilter-regex is enabled in plugin.includes.
+ 
+ ==== A note on slashes after file: ====
+ 
+ When converting {{{file:}}} URLs from the Java URL class back only one slash remains:
+ {{{
+ String url = "file:///path/index.html";
+ java.net.URL u = new java.net.URL(url);
+ url = u.toString();  // url is now file:/path/index.html
+ }}}
+ Because such conversions are quite frequent, you better writer URLs (and also URL filter
rules, etc.) with a single slash ({{{file:/path/index.html}}}). Nutch's URL normalizers in
the default configuration also normalize file: URLs to have only one slash.
  
  ==== How do I index remote file shares? ====
  At the current time, Nutch does not have built in support for accessing files over SMB (Windows)
shares.  This means the only available method is to mount the shares yourself, then index
the contents as though they were local directories (see above).

Mime
View raw message