nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-289) CrawlDatum should store IP address
Date Wed, 31 May 2006 07:39:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413996 ] 

Andrzej Bialecki  commented on NUTCH-289:
-----------------------------------------

Re: lookup in ParseOutputFormat: I respectfully disagree. Consider the scenario when you run
Fetcher in non-parsing mode. This means that you have to make two DNS lookups - once when
fetching, and the second time when parsing. These lookups will be executed from different
processes, so there is no benefit from caching inside Java resolver, i.e. the process will
have to call the DNS server twice. The solution I proposed (record IP-s in Fetcher, but somewhere
else than in ParseOutputFormat, e.g. crawl_fetch CrawlDatum) avoids this problem.

Another issue is virtual hosting, i.e. many sites resolving to a single IP (web hotels). It's
true that in many cases these are spam sites, but often as not they are real, legitimate sites.
If we generate/fetch by IP address we run the risk of dropping legitimate sites.

Regarding the timing: it's true that during the first run we won't have IP-s during generate
(and subsequently for any newly injected URLs). In fact, since usually a significant part
of crawlDB is unfetched we won't have this information for many URLs - unless we run this
step in Generator to resolve ALL hosts, and then run an equivalent of updatedb to actually
record them in crawldb.

And the last issue that needs to be discussed: should we use metadata, or add a dedicated
field in CrawlDatum? If the core should rely on IP addresses, we should add it as a dedicated
field. If it would be purely optional (e.g. for the use by optional plugins), then metadata
seems a better place.

> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This would be a
good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a new outlink,
or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message