incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul.vc" <p...@paul.vc>
Subject Re: Droids suitability for a 100M+ page crawl
Date Fri, 25 Mar 2011 21:09:47 GMT
On Fri, 25 Mar 2011 13:08:40 -0700 (PDT), Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> Hi,
> 
> Somebody (Paul?) mentioned using Droids for doing a 50M page crawl. 
> Anyone else 
> using Droids for crawls of that size?

Yes, but A "little" bit more :) I had a seed of 60M hosts, crawled about 2
billion pages on a 16 node cluster. The crawl took about 3 weeks with an
average bandwidth of 35mbit per node.
 
> I'm asking because I have a need to do a "semi-vertical" crawl on up to
> 10K 
> domains and I'm considering Droids vs. Nutch.  This may translate to
> several 
> times that many different servers - say 100K.  And that may translate to
a
> few 
> 100M web pages.  Too big for Droids without having a persistent link
> queue, 
> right?


I had no more than 128 Droid-Threads running per machine with a total
in-memory-queue limit (for all threads per node) of ~ 500.000 pages. Had a
couple of tweaks here and there plus an efficient tree structure for
storing visited urls. With 10gb you could easily go up to 2 million entries
when running only one droid thread per VM. Perhaps even more. Here is
something pretty to look at: http://twitpic.com/4d87i7 ;) My crawler design
was rather simple: ((one host - one droid, stay on the host) * 128 threads
) * 16 nodes and one master node for the global seed queue. If your crawl
is to behave more organic and may be executed in a less controlled
environment: take a look at 80Legs.  

If you choose to go with droids you will have to spend some time on the
http client settings and implement some workarounds in droids / http client
to prevent stale/stuck sockets, handle chunked stream aborts correctly
(crawler traps serving you an endless stream of links for instance), take
care of rather critical robots issue. The neko html parser is also full of
surprises, be careful with that. At least when you are lazy like me and use
the DOM model instead of SAX to preparse some meta-tags and do content
processing. 

I don't see how my spare time would allow to clean up my battlefield and
release all of it back to the community any time soon, but I am not sitting
on that code either. Talk to me if you want to port aspects of that back to
droids. I would be more than happy to rip out chunks of that code and pass
it along for proper integration into the main branch.

Regards,
Paul.

Mime
View raw message