nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1325) HostDB for Nutch
Date Thu, 21 Jan 2016 15:09:39 GMT


Markus Jelsma commented on NUTCH-1325:

Yes, they are very useful for finding websites that, for example, overall score positively
on custom text or structure classifiers such as give me all hosts that in general talk about
music, politics or illicit topics. Also, the dumping can generate a wide variety of blacklists
for e.g. not crawling (generating) certain hosts, not indexing them of removing them completely.
Of course, if your erase hosts from your CrawlDB, you must keep the blacklist around, or it
will come back at some point :)

> HostDB for Nutch
> ----------------
>                 Key: NUTCH-1325
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: hostdb
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch,
NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, NUTCH-1325.patch,
NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.trunk.v2.path, oi-hostdb.patch,
oi-hostdb.patch, oi-hostdb.patch
> h1. HostDB for Apache Nutch 1.x
> * automatically generates a HostDB based on CrawlDB information
> * periodically performs DNS lookup for all hosts and keeps track of DNS failures
> * discovers homepage if is a redirect
> * keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
> * aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable
> * can output lists of discovered homepage URL's for seed lists and static fetch interval
> *can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB
using domainblacklist-urlfilter
> * just like CrawlDB support for JEXL expressions
> h4. Examples
> Generate for the first time, or update and existing HostDB:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
> {code}
> Optional filtering or normalizing:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
> {code}
> Dumping as CSV file:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory
> {code}
> Get only hostnames with have average response time above 50ms:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ >
> {code}
> Get only hosts that have over 50% 404's:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords
> 0.5)"
> {code}
> For JEXL expressions, all host metadata fields are available. All other fields are also
available as:
> unfetched -- number of unfetched records
> fetched -- number of fetched records
> gone -- number of  404's
> redirTemp -- number if temporary redirects
> redirPerm -- number if permanent redirects
> redirs -- total number of redirects (redirTemp + redirPerm)
> notModified -- number of not modified records
> ok -- number of usable pages (fetched + notModified)
> numRecords -- total number of records
> dnsFailures -- number of DNS failures
> Also, see nutch-default for hostdb.* properties.

This message was sent by Atlassian JIRA

View raw message