nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken van Mulder <>
Subject Re: nutch questions
Date Fri, 09 Dec 2005 16:06:45 GMT
Thanks Stefan. I'll resend this to the user list as well. Just thought 
the dev list might be better since we're using the map/reduce version.


Stefan Groschupf wrote:
> Ken,
> may the user mailing list would be a better place for such questions.
> The size of your index  depends on you configuration(what kind of  index 
> filter plugins you use)
> You can say a document in the index needs 10KB  plus the meta data  like 
> date, content type or category of the page.
> Storing the pages content took around  64KB for each page.
> You also need to store  a linkgraph and a list of known urls - web db.
> I would say  each 100 Mio document require 1 TB of storage.
> Information about query speed can be found in the index, as a role of  
> thumb 4 GB of RAM can handle 20 queries per second by 2 Million  
> documents per box.
> So in general you need many boxes, but the more expansive part of  such 
> a project is bandwidth.
> Nutch 0.8 works well, however you have to write some custom jobs to  get 
> some standart jobs done, also storing index on the distributed  
> filesystem and search it from there is very very slow. Beside that  
> nutch has serious problems with spam detection in very large indexes.
> Stefan
> Am 09.12.2005 um 00:59 schrieb Ken van Mulder:
>> Hey folks,
>> We're looking at launching a search engine in the beginning of the  
>> new year that will eventually grow to being a multi-billion page  
>> index. Three questions:
>> First, and most important for now, does anyone have any useful  
>> numbers for what the hardware requirements are to run such an  engine? 
>> I have numbers for how fast I can get the crawler's  working. But not 
>> for how many pages can be served off of each  search node and how much 
>> processing power is required for the  indexing, etc.
>> Second, what all needs to be done to Nutch yet in order for it to  be 
>> able to handle billions of pages? Is there a general list of  
>> requirements?
>> Third, if nutch isn't capable of doing what we need, what is the  
>> expected upper limit for it? Using the map/reduce version.
>> Thanks,
>> -- 
>> Ken van Mulder
>> Wavefire Technologies Corporation
>> 250.717.0200 (ext 113)

Ken van Mulder
Wavefire Technologies Corporation
250.717.0200 (ext 113)

View raw message