nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Fetching inefficiency
Date Tue, 22 Apr 2008 00:06:01 GMT
>Adding some comments to the email below, but here on nutch-dev. 
>Basically, it is my feeling that whenever fetchlists (and its parts) 
>are not "well balanced", this inefficiency will be seen. Concretely, 
>whichever task is "stuck fetching from the slow server with a lot of 
>its pages in the fetchlist", it will prolong the whole fetch job. 
>Slow server with lots of pages is a bad combination, and I see that 
>a lot.  Perhaps it's the nature of my crawl - it is constrained, not 
>web-side, with the number of distinct hosts is around 15-20K?

[snip]

>Some questions: Are there ways around this? Are others not seeing 
>the same behaviour? Is this just the nature of my crawl - 
>constrained and with only 15-20K unique servers?

We often ran into the same problem, while doing our vertical tech 
pages crawl - smaller number of unique hosts, and some really slow 
hosts slowing down the entire fetch cycle.

We added code that terminated slow fetches. After fooling around with 
some different approaches, I think we settled on terminating all 
remaining fetches when the number of active fetch threads dropped 
below a threshold set from the total # of threads available. The 
ratio was set to 20% or so.

URLs that were terminated in this manner would get their status set 
to the same as if the page had returned a "temp unavailable" HTTP 
response, IIRC.

This worked pretty well, though we had to hack the httpclient lib 
because even when you interrupted a fetch, there was some cleanup 
code executed during a socket close that would try to empty the 
stream, and for some slow servers this would still cause the fetch to 
hang.

-- Ken


>If others are seeing this behaviour, then I'm wondering if others 
>have any thoughts about improving this either before 1.0 or after 
>1.0 release?  For instance, maybe things would be better with that 
>HostDb and a Generator that knows not to produce fetchlists with 
>lots of URLs from slow servers?  Or maybe there is a way to keep 
>feeding Fetchers with URLs from other sites, so its idle threads can 
>be kept busy instead of in spinWait status? Thanks, Otis -- Sematext 
>-- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original 
>Message ---- > From: "ogjunk-nutch@yahoo.com" 
><ogjunk-nutch@yahoo.com> > To: Nutch User List 
><nutch-user@lucene.apache.org> > Sent: Monday, April 21, 2008 
>4:16:24 PM > Subject: Fetching inefficiency > > Hello, > > I am 
>wondering how others deal with the following, which I see as 
>fetching > inefficiency: > > > When fetching, the fetchlist is 
>broken up into multiple parts and fetchers on > cluster nodes start 
>fetching.  Some fetchers end up fetching from fast servers, > and 
>some from very very slow servers.  Those fetching from slow servers 
>take a > long time to complete and prolong the whole fetching 
>process.  For instance, > I've seen tasks from the same fetch job 
>finish in only 1-2 hours, and others in > 10 hours.  Those taking 10 
>hours were stuck fetching pages from a single or > handful of slow 
>sites.  If you have two nodes doing the fetching and one is > stuck 
>with a slow server, the other one is idling and wasting time.  The 
>node > stuck with the slow server is also underutilized, as it's 
>slowly fetching from > only 1 server instead of many. > > I imagine 
>anyone using Nutch is seeing the same.  If not, what's the 
>trick? > > I have not tried overlapping fetching jobs yet, but I 
>have a feeling that won't > help a ton, plus it could lead to two 
>fetchers fetching from the same server and > being impolite - am I 
>wrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- 
>Lucene - Solr - Nutch


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Mime
View raw message