nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ogjunk-nu...@yahoo.com
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Tue, 22 Apr 2008 05:46:44 GMT
Thanks Andrzej.  So the disconnect was only measuring (download speed 
in my mind) per-URL vs. per-host

In that case, I think we are talking about a small change (to Fetcher2) that might
look like this:

+              // time the request
+              long fetchStart = System.currentTimeMillis();
               ProtocolOutput output = protocol.getProtocolOutput(fit.url, fit.datum);
+              long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
               ProtocolStatus status = output.getStatus();
               Content content = output.getContent();
               ParseStatus pstatus = null;
+
+              // compute page download speed
+              int kbps = Math.round(((((float)content.getContent().length)*8)/1024)/fetchTime);
+              LOG.info("Fetch time: " + fetchTime + " KBPS: " + kbps + " URL: " + fit.url);
+//              fit.datum.getMetaData().put(new Text("kbps"), new IntWritable(kbps));

I *think* the updatedb step will keep any new keys/values in that MetaData
 MapWritable in the CrawlDatum while merging, right?


Then, HostDb would run through CrawlDb and aggregate (easy).
But:
What other data should be stored in CrawlDatum?
How exactly should that data be aggregated? (mostly added up?)
How exactly will this data then be accessed? (I need to be able to do host-based lookup)


My immediate interest is computing per-host download speed, so 
that Generator can take that into consideration when creating fetchlists.
I'm not even 100% sure if that will even have a positive effect on the
overall fetch speed, but I imagine I will have to load this HostDb data
in a Map, so I can change Generator and add something like this
in one of its map or reduce methods:

int kbps = hostdb.get(host);
if (kbps < N) { don't emit this host }

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Andrzej Bialecki <ab@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Sunday, April 20, 2008 4:56:24 PM
> Subject: Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level
information
> 
> ogjunk-nutch@yahoo.com wrote:
> 
> 
> > Host extraction from URL makes sense, but there would be no host-level
> > data in CrawlDatum.  For example, one of the things I'd like to track is
> > download speed.  I don't want to track that on the per-URL level, but on
> > a per-host level.  I'd keep track of the d/l speed for each host in Fetcher2
> > and its FetcherInputQueue (that part is in JIRA already). 
> > 
> > So I'm not sure how I'd put the d/l speed for a host in the CrawlDatum....
> 
> You really don't have to - see below. The queue monitoring stuff in 
> Fetcher gives you only the current fetchlist metrics anyway, so they are 
> incomplete - you need to calculate the actual averages from all urls 
> from that host, and not just the current fetchlist. That's why it's 
> better to do this using the information from CrawlDb and not just from 
> the current segment.
> 
> So, let's assume for a moment that you don't track the d/l speed per 
> host in Fetchers, or you discard this information, and assume that you 
> only add the actually measured per-url download speed to crawl_fetch, as 
> part of CrawlDatum.metaData. This metadata will be merged to the CrawlDb 
> during the updatedb operation (replacing any older values if they exist).
> 
> 
> > 
> >> Reduce: input: 
> >>     output:  // aggregated
> > 
> > 
> > Let's try with a concrete example.
> > Imagine I just ran a fetch job and that fetched some number of URLs
> > from www.foo.com and www.bar.com. foo.com aggregate d/l speed for
> > that fetch run was 50 kbps.  bar.com speed was 20 kbps.
> > 
> > At the end of the run, I'd somehow store, say:
> > www.foo.comdl_speed:50requests:100timeouts:0
> > www.bar.comdl_speed:20requests:90timeouts:20
> 
> No, what you want to store is this:
> 
> www.example.com/page1.html dl_speed:50 status:ok
> www.example.com/page2.html dl_speed:45 status:ok
> www.example.com/page3.html dl_speed:0 status:gone
> ...
> 
> > 
> > Then, I was thinking, something else (some HostDb MapReduce job)
> > would go through this data stored under segment/2008...../something/
> > and merge it into crawl/hostdb file.
> > 
> > It sounds like you are saying, this should stick the data in CrawlDatum
> > and let that be merged into crawl/crawldb.... but I don't see how I'd put
> > the numbers from the above example into CrawlDatum without
> > repeating them, so that each URL from www.foo.com has those 3
> > numbers above for www.foo.com stored in their crawldb entries.
> 
> See above - we store only per-url metrics in CrawlDb. Then the HostDb 
> job aggregates the info from CrawlDb using host name (or domain name, or 
> TLD) as the key.
> 
> >> PS. Could you please wrap your lines to 80 chars? I always have to
> >> re-wrap your emails when responding, otherwise they consists of very
> >> very long lines ..
> > 
> > 
> > Sorry about that.  I wrapped them manually here.
> 
> Thanks. Mail apps are no longer what they used to be ...
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message