nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney
Date Thu, 07 Feb 2013 02:49:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "FAQ" page has been changed by LewisJohnMcgibbney:


  Urls which are already in the database, won't be injected.
  === Fetching ===
+ ==== Can I parse during the fetching process? ====
+ In short yes, however this is disabled by default (justification follows shortly). To enable
this simply configure the following in nutch-site.xml before initiating the fecth process.
+ {{{
+ <property>
+   <name>fetcher.parse</name>
+   <value>true</value>
+   <description>If true, fetcher will parse content. Default is false, which means
+   that a separate parsing step is required after fetching is finished.</description>
+ </property>
+ }}} 
+ '''N.B.''' In a parsing fetcher, outlinks are processed in the mapper (at least when outlinks
are followed). If a fetcher's reducer stalls you may run out of memory or disk space, usually
after a very long reduce job. Behaviour typical to [[|this]]
is usually observed in this situation. 
+ In summary, if it is possible, users are advised '''not''' to use a parsing fetcher as it
is heavy on IO and often leads to the above outcome.
  ==== Is it possible to fetch only pages from some specific domains? ====
  Please have a look on PrefixURLFilter. Adding some regular expressions to the regex-urlfilter.txt
file might work, but adding a list with thousands of regular expressions would slow down your
system excessively.

View raw message