nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information
Date Tue, 22 Apr 2008 08:48:22 GMT
ogjunk-nutch@yahoo.com wrote:

> +              // time the request
> +              long fetchStart = System.currentTimeMillis();
>                ProtocolOutput output = protocol.getProtocolOutput(fit.url, fit.datum);
> +              long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
>                ProtocolStatus status = output.getStatus();
>                Content content = output.getContent();
>                ParseStatus pstatus = null;
> +
> +              // compute page download speed
> +              int kbps = Math.round(((((float)content.getContent().length)*8)/1024)/fetchTime);
> +              LOG.info("Fetch time: " + fetchTime + " KBPS: " + kbps + " URL: " + fit.url);
> +//              fit.datum.getMetaData().put(new Text("kbps"), new IntWritable(kbps));


Yes, that's more or less correct.

> 
> I *think* the updatedb step will keep any new keys/values in that MetaData
>  MapWritable in the CrawlDatum while merging, right?

Right.


> Then, HostDb would run through CrawlDb and aggregate (easy).
> But:
> What other data should be stored in CrawlDatum?

I can think of a few other useful metrics:

* the fetch status (failure / success) - this will be aggregated into a 
failure rate per host.

* number of outlinks - this comes useful when determining areas within a 
site with high density of links

* content type

* size

Other aggregated metrics, which are derived from urls alone, could 
include the following:

* number of urls with query strings (may indicate spider traps or a 
database)

* total number of urls from a host (obvious :) ) - useful to limit the 
max. number of urls per host.

* max depth per host - again, to enforce limits on the max. depth of the 
crawl, where depth is defined as the maximum number of elements in URL path.

* spam-related metrics (# of pages with known spam hrefs, keyword 
stuffing in meta-tags, # of pages with spam keywords, etc, etc).

Plus a range of arbitrary operator-specific tags / metrics, usually 
manually added:

* special fetching parameters (maybe authentication, or the overrides 
for crawl-delay or the number of threads per host)

* other parameters affecting the crawling priority or page ranking for 
all pages from that host

As you can see, possibilities are nearly endless, and revolve around the 
issues of crawl quality and performance.


> How exactly should that data be aggregated? (mostly added up?)

Good question. Some of these are aggregates, some others are running 
averages, some others still perhaps form a historical log of a number of 
most recent values. We could specify a couple of standard operations, 
such as:

* SUM - initialize to zero, and add all values

* AVG - calculate arithmetic average from all values

* MAX / MIN - retain only the largest . smallest value

* LOG - keep a log of the last N values - somewhat orthogonal concept to 
the above, i.e. it could be a valid option for any of the above operations.

This complicates the simplistic model of HostDB that we had :) and 
indicates that we may need a sort of a schema descriptor for it.

> How exactly will this data then be accessed? (I need to be able to do host-based lookup)

Ah, this is an interesting issue :) All tools in Nutch work using 
URL-based keys, which means they operate on a per-page level. Now we 
need to join this with HostDb, which uses host names as keys. If you 
want to use HostDb as one of the inputs to a map-reduce job, then I 
described this problem here, and Owen O'Malley provided a solution:

https://issues.apache.org/jira/browse/HADOOP-2853

This would require significant changes in severa Nutch tools, i.e. 
several m-r jobs would have to be restructured.

There is however a different approach too, which may be efficient enough 
- put the HostDb in a DistributedCache, and read it directly as a 
MapFile (or BloomMapFile - see HADOOP-3063) - I tried this and for 
medium-size datasets the performance was acceptable.


> My immediate interest is computing per-host download speed, so 
> that Generator can take that into consideration when creating fetchlists.
> I'm not even 100% sure if that will even have a positive effect on the
> overall fetch speed, but I imagine I will have to load this HostDb data
> in a Map, so I can change Generator and add something like this
> in one of its map or reduce methods:
> 
> int kbps = hostdb.get(host);
> if (kbps < N) { don't emit this host }

Right, that's how the API could look like using the second approach that 
I outlined above. You could even wrap it in a URLFilter plugin, so that 
you don't have to modify the Generator.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message