nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Sat, 19 Apr 2008 22:07:17 GMT
ogjunk-nutch@yahoo.com wrote:

> I do understand that CrawlDb is the source to get all known URLs
> from, and from those URLs we can extract host names, domains, etc.
> (what DomainStatistics tool does), but I don't understand how you'd
> use CrawlDb as the source of per-host data, since CrawlDb does not
> have aggregate per-host data.  Shouldn't that live in a separate
> file, a file that can updated after every fetch run?

Well, as it happens, map-reduce is exceptionally good at collecting 
aggregate data :) This is a simple map-reduce job, where we do the 
following:

Map: 	input: <url, CrawlDatum> from CrawlDb
	output: <host, hostStats>

Host is extracted from the current url, and hostStats is extracted from 
the data in this CrawlDatum

Reduce: input: <host, (hostStats1, hostStats2, ...)>
	output: <host, hostStats> // aggregated



PS. Could you please wrap your lines to 80 chars? I always have to
re-wrap your emails when responding, otherwise they consists of very
very long lines ..

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message