nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "bin/nutch_hostinject" by LewisJohnMcgibbney
Date Fri, 11 Jan 2013 01:59:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/nutch_hostinject" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_hostinject

New page:
hostinject is an alias for org.apache.nutch.crawl.HostInjectorJob

This class takes a flat file of hosts and adds them to the of seeds to be crawled. It is useful
for bootstrapping the system. The hosts files contain one host per line, optionally followed
by custom metadata separated by tabs with the metadata key separated from the corresponding
value by '='. '''N.B. Is the metadata functionality supported yet?'''.

Note that some metadata keys are reserved: 

''nutch.score'': allows to set a custom score for a specific URL

''nutch.fetchInterval'': allows to set a custom fetch interval for a specific URL 

e.g. http://www.xyz.org/ nutch.score=10 nutch.fetchInterval=2592000 userType=open_source

Usage: 
{{{
bin/nutch hostinject <host_dir>
}}}

'''<host_dir>''': The directory containing our seed list (referred to above as 'flat
file'), usually a text document containing hosts, one host per line.


CommandLineOptions

Mime
View raw message