nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1325) HostDB for Nutch
Date Thu, 21 Jan 2016 13:34:39 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1325:
---------------------------------
     Patch Info: Patch Available
    Description: 
h1. HostDB for Apache Nutch 1.x

* automatically generates a HostDB based on CrawlDB information
* periodically performs DNS lookup for all hosts and keeps track of DNS failures
* discovers homepage if www.example.org/ is a redirect
* keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
* aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable
percentiles
* can output lists of discovered homepage URL's for seed lists and static fetch interval
*can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB
using domainblacklist-urlfilter
* just like CrawlDB support for JEXL expressions

h4. Examples

Generate for the first time, or update and existing HostDB:
{code}
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
{code}

Optional filtering or normalizing:
{code}
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
{code}

Dumping as CSV file:
{code}
bin/nutch readhostdb crawl/hostdb output_directory
{code}

Get only hostnames with have average response time above 50ms:
{code}
bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
{code}

Get only hosts that have over 50% 404's:
{code}
bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords
> 0.5)"
{code}

For JEXL expressions, all host metadata fields are available. All other fields are also available
as:

unfetched -- number of unfetched records
fetched -- number of fetched records
gone -- number of  404's
redirTemp -- number if temporary redirects
redirPerm -- number if permanent redirects
redirs -- total number of redirects (redirTemp + redirPerm)
notModified -- number of not modified records
ok -- number of usable pages (fetched + notModified)
numRecords -- total number of records
dnsFailures -- number of DNS failures

Also, see nutch-default for hostdb.* properties.

  was:
A HostDB for Nutch and associated tools to create and read a database containing information
on hosts.



> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch,
NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, NUTCH-1325.patch,
NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch,
oi-hostdb.patch
>
>
> h1. HostDB for Apache Nutch 1.x
> * automatically generates a HostDB based on CrawlDB information
> * periodically performs DNS lookup for all hosts and keeps track of DNS failures
> * discovers homepage if www.example.org/ is a redirect
> * keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
> * aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable
percentiles
> * can output lists of discovered homepage URL's for seed lists and static fetch interval
> *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB
using domainblacklist-urlfilter
> * just like CrawlDB support for JEXL expressions
> h4. Examples
> Generate for the first time, or update and existing HostDB:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
> {code}
> Optional filtering or normalizing:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
> {code}
> Dumping as CSV file:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory
> {code}
> Get only hostnames with have average response time above 50ms:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ >
50)"
> {code}
> Get only hosts that have over 50% 404's:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords
> 0.5)"
> {code}
> For JEXL expressions, all host metadata fields are available. All other fields are also
available as:
> unfetched -- number of unfetched records
> fetched -- number of fetched records
> gone -- number of  404's
> redirTemp -- number if temporary redirects
> redirPerm -- number if permanent redirects
> redirs -- total number of redirects (redirTemp + redirPerm)
> notModified -- number of not modified records
> ok -- number of usable pages (fetched + notModified)
> numRecords -- total number of records
> dnsFailures -- number of DNS failures
> Also, see nutch-default for hostdb.* properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message