nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ogjunk-nu...@yahoo.com
Subject Re: Fetching inefficiency
Date Mon, 21 Apr 2008 20:40:04 GMT
Adding some comments to the email below, but here on nutch-dev.

Basically, it is my feeling that whenever fetchlists (and its parts) are not "well balanced",
this inefficiency will be seen.
Concretely, whichever task is "stuck fetching from the slow server with a lot of its pages
in the fetchlist", it will prolong the whole fetch job.  Slow server with lots of pages is
a bad combination, and I see that a lot.  Perhaps it's the nature of my crawl - it is constrained,
not web-side, with the number of distinct hosts is around 15-20K?

* Example fetchlist part:
slow.com/1
fast.com/1
ok.com/1
slow.com/2
fast.com/2
ok.com/2
ok.com/3
slow.com/3
slow.com/4
slow.com/5
slow.com/6

* The above fetchlist part will take a lot longer than this one:
speedy.com/1
speedy.com/2
speedy.com/3
speedy.com/4
superspeedy.com/1
ok2.com/1
ok2.com/2
speedy.com/5
speedy.com/6
speedy.com/7
ok2.com/3
speedy.com/8

The task processing the first set of URLs will be slower because it got the slow slow.com
server and slow.com happens to have a lot of pages in that fetchlist part.  The task processing
the second set of URLs will be quick, since all its servers are pretty fast.

Some questions:
Are there ways around this?
Are others not seeing the same behaviour?
Is this just the nature of my crawl - constrained and with only 15-20K unique servers?

If others are seeing this behaviour, then I'm wondering if others have any thoughts about
improving this either before 1.0 or after 1.0 release?  For instance, maybe things would be
better with that HostDb and a Generator that knows not to produce fetchlists with lots of
URLs from slow servers?  Or maybe there is a way to keep feeding Fetchers with URLs from other
sites, so its idle threads can be kept busy instead of in spinWait status?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "ogjunk-nutch@yahoo.com" <ogjunk-nutch@yahoo.com>
> To: Nutch User List <nutch-user@lucene.apache.org>
> Sent: Monday, April 21, 2008 4:16:24 PM
> Subject: Fetching inefficiency
> 
> Hello,
> 
> I am wondering how others deal with the following, which I see as fetching 
> inefficiency:
> 
> 
> When fetching, the fetchlist is broken up into multiple parts and fetchers on 
> cluster nodes start fetching.  Some fetchers end up fetching from fast servers, 
> and some from very very slow servers.  Those fetching from slow servers take a 
> long time to complete and prolong the whole fetching process.  For instance, 
> I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 
> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
> handful of slow sites.  If you have two nodes doing the fetching and one is 
> stuck with a slow server, the other one is idling and wasting time.  The node 
> stuck with the slow server is also underutilized, as it's slowly fetching from 
> only 1 server instead of many.
> 
> I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
> 
> I have not tried overlapping fetching jobs yet, but I have a feeling that won't 
> help a ton, plus it could lead to two fetchers fetching from the same server and 
> being impolite - am I wrong?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Mime
View raw message