nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Groschupf (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-289) CrawlDatum should store IP address
Date Thu, 01 Jun 2006 18:41:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414273 ] 

Stefan Groschupf commented on NUTCH-289:
----------------------------------------

Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may be misunderstand
each other. 
Resolve the ip in Parseoutputformat would be only necessary for the new links discovered in
the content. 
Since by default we parse during fetching we would have the chance to use the jvm dns cache,
since I guess many new urls point to the same host where we fetched a particular page from.
Means if we do not parse separately we would have the best jvm cache usage. 
We do not lookup IPs of urls we fetch at this time, since these urls already have a ip that
was resoved when these urls was first time discovered in a parse process. 
The only problem we need to handle is what happens in case a ip of a host change. We can simple
lookup the ip of a url that throws a protocol error and compare cached and lookup ip.
An alternative aproche would be to lookup ip's during crawldb update just for the new urls.
Sorry I hope that describe my ideas more clearly. 

My personal point of view is to store the ip into the crawldatum not into the meta data.






> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This would be a
good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a new outlink,
or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message