nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ogjunk-nu...@yahoo.com
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Sun, 20 Apr 2008 00:50:15 GMT
Hi,

(Andrzej - sorry about line length - I don't see an option for that in Y! Mail
now/any more, BCCing my non-Y account to see what's going on)

----- Original Message ----
> From: Andrzej Bialecki <ab@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Saturday, April 19, 2008 6:07:17 PM
> Subject: Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level
information
> 
> ogjunk-nutch@yahoo.com wrote:
> 
> > I do understand that CrawlDb is the source to get all known URLs
> > from, and from those URLs we can extract host names, domains, etc.
> > (what DomainStatistics tool does), but I don't understand how you'd
> > use CrawlDb as the source of per-host data, since CrawlDb does not
> > have aggregate per-host data.  Shouldn't that live in a separate
> > file, a file that can updated after every fetch run?
> 
> Well, as it happens, map-reduce is exceptionally good at collecting 
> aggregate data :) This is a simple map-reduce job, where we do the 
> following:

Right, it is, but...comments below.

> Map:     input: <url, CrawlDatum> from CrawlDb
>     output: <host, hostStats>
> 
> Host is extracted from the current url, and hostStats is extracted from 
> the data in this CrawlDatum


Host extraction from URL makes sense, but there would be no host-level
data in CrawlDatum.  For example, one of the things I'd like to track is
download speed.  I don't want to track that on the per-URL level, but on
a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2
and its FetcherInputQueue (that part is in JIRA already). 

So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum....

> Reduce: input: <host, (hostStats1, hostStats2, ...)>
>     output:  <host, hostStats> // aggregated


Let's try with a concrete example.
Imagine I just ran a fetch job and that fetched some number of URLs
from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
that fetch run was 50 kbps.  bar.com speed was 20 kbps.

At the end of the run, I'd somehow store, say:
www.foo.comdl_speed:50requests:100timeouts:0
www.bar.comdl_speed:20requests:90timeouts:20

Then, I was thinking, something else (some HostDb MapReduce job)
would go through this data stored under segment/2008...../something/
and merge it into crawl/hostdb file.

It sounds like you are saying, this should stick the data in CrawlDatum
and let that be merged into crawl/crawldb.... but I don't see how I'd put
the numbers from the above example into CrawlDatum without
repeating them, so that each URL from www.foo.com has those 3
numbers above for www.foo.com stored in their crawldb entries.

> PS. Could you please wrap your lines to 80 chars? I always have to
> re-wrap your emails when responding, otherwise they consists of very
> very long lines ..


Sorry about that.  I wrapped them manually here.

Otis

Mime
View raw message