nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Sun, 20 Apr 2008 20:56:24 GMT
ogjunk-nutch@yahoo.com wrote:


> Host extraction from URL makes sense, but there would be no host-level
> data in CrawlDatum.  For example, one of the things I'd like to track is
> download speed.  I don't want to track that on the per-URL level, but on
> a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2
> and its FetcherInputQueue (that part is in JIRA already). 
> 
> So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum....

You really don't have to - see below. The queue monitoring stuff in 
Fetcher gives you only the current fetchlist metrics anyway, so they are 
incomplete - you need to calculate the actual averages from all urls 
from that host, and not just the current fetchlist. That's why it's 
better to do this using the information from CrawlDb and not just from 
the current segment.

So, let's assume for a moment that you don't track the d/l speed per 
host in Fetchers, or you discard this information, and assume that you 
only add the actually measured per-url download speed to crawl_fetch, as 
part of CrawlDatum.metaData. This metadata will be merged to the CrawlDb 
during the updatedb operation (replacing any older values if they exist).


> 
>> Reduce: input: <host, (hostStats1, hostStats2, ...)>
>>     output:  <host, hostStats> // aggregated
> 
> 
> Let's try with a concrete example.
> Imagine I just ran a fetch job and that fetched some number of URLs
> from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
> that fetch run was 50 kbps.  bar.com speed was 20 kbps.
> 
> At the end of the run, I'd somehow store, say:
> www.foo.comdl_speed:50requests:100timeouts:0
> www.bar.comdl_speed:20requests:90timeouts:20

No, what you want to store is this:

www.example.com/page1.html dl_speed:50 status:ok
www.example.com/page2.html dl_speed:45 status:ok
www.example.com/page3.html dl_speed:0 status:gone
...

> 
> Then, I was thinking, something else (some HostDb MapReduce job)
> would go through this data stored under segment/2008...../something/
> and merge it into crawl/hostdb file.
> 
> It sounds like you are saying, this should stick the data in CrawlDatum
> and let that be merged into crawl/crawldb.... but I don't see how I'd put
> the numbers from the above example into CrawlDatum without
> repeating them, so that each URL from www.foo.com has those 3
> numbers above for www.foo.com stored in their crawldb entries.

See above - we store only per-url metrics in CrawlDb. Then the HostDb 
job aggregates the info from CrawlDb using host name (or domain name, or 
TLD) as the key.

>> PS. Could you please wrap your lines to 80 chars? I always have to
>> re-wrap your emails when responding, otherwise they consists of very
>> very long lines ..
> 
> 
> Sorry about that.  I wrapped them manually here.

Thanks. Mail apps are no longer what they used to be ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message