nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Tietze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
Date Thu, 28 Nov 2013 19:57:35 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835056#comment-13835056
] 

Walter Tietze commented on NUTCH-1360:
--------------------------------------

Hi Lewis,

according to the mail I sent to you, I provide my patch for storing ip addresses in apache-nutch-1.5.1
as attachment.

( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)

In our project MIA (http://mia-marktplatz.de/) we spider the german www. To stay polite we
had to switch to a 'byIP' policy to guarantee request frequencies of at least one minute per
server. Crawling 'byHost' was no option, because many sites use up to some thousand subdomains
hosted at a single server with one ip address. 
In proceeding with our crawl I realized that crawling by IP seemed to slow down, because in
the process of generating the url lists nutch has to determine the ip address to build up
the queues for urls according to their ip addresses. 

This solution is a simple solution which writes the once determined ip address into the metadata
field of the CrawlDatum object. When a crawl cycle has finished its fetch job an additional
map-reduce job is started to determine the ip addresses  of newly fetched and parsed urls.
New urls are inserted into the crawldb with their ip addresses if an ip address could have
been determined.

In this solution there exist also the two classes IpAddressResolver.java and DNSCache.java
which cache already fetched ip addresses from the DNS and control the number of concurrent
calls to the DNS from each map job. Since many urls with the same ip address should be generated
into a queue I wanted to minimize the load which is taken to build up the queues. Caching
ip addresses in-memory shouldn't be memory-consuming. To avoid to many concurrent requests
to a DNS from the crawler, I added some code to restrict the number of parallel requests to
the DNS.

I use this piece of code in production since about three-quarters this year and it seems to
work fine. The four configuration entries should be self-explaining. 

Cheers, Walter

 







> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch,
NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which we connect
to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message