nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney
Date Thu, 07 Feb 2013 02:49:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=135&rev2=136

Comment:
a

  Urls which are already in the database, won't be injected.
  
  === Fetching ===
+ 
+ ==== Can I parse during the fetching process? ====
+ In short yes, however this is disabled by default (justification follows shortly). To enable
this simply configure the following in nutch-site.xml before initiating the fecth process.
+ {{{
+ <property>
+   <name>fetcher.parse</name>
+   <value>true</value>
+   <description>If true, fetcher will parse content. Default is false, which means
+   that a separate parsing step is required after fetching is finished.</description>
+ </property>
+ }}} 
+ 
+ '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at least when outlinks
are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually
after a very long reduce job. Behaviour typical to [[http://www.mail-archive.com/user@nutch.apache.org/msg05031.html|this]]
is usually observed in this situation. 
+ 
+ In summary, if it is possible, users are advised '''not''' to use a parsing fetcher as it
is heavy on IO and often leads to the above outcome.
+  
  ==== Is it possible to fetch only pages from some specific domains? ====
  Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt
file might work, but adding a list with thousands of regular expressions would slow down your
system excessively.
  

Mime
View raw message