nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: nutch questions
Date Fri, 09 Dec 2005 10:27:58 GMT
may the user mailing list would be a better place for such questions.
The size of your index  depends on you configuration(what kind of  
index filter plugins you use)

You can say a document in the index needs 10KB  plus the meta data  
like date, content type or category of the page.
Storing the pages content took around  64KB for each page.
You also need to store  a linkgraph and a list of known urls - web db.
I would say  each 100 Mio document require 1 TB of storage.

Information about query speed can be found in the index, as a role of  
thumb 4 GB of RAM can handle 20 queries per second by 2 Million  
documents per box.
So in general you need many boxes, but the more expansive part of  
such a project is bandwidth.

Nutch 0.8 works well, however you have to write some custom jobs to  
get some standart jobs done, also storing index on the distributed  
filesystem and search it from there is very very slow. Beside that  
nutch has serious problems with spam detection in very large indexes.


Am 09.12.2005 um 00:59 schrieb Ken van Mulder:

> Hey folks,
> We're looking at launching a search engine in the beginning of the  
> new year that will eventually grow to being a multi-billion page  
> index. Three questions:
> First, and most important for now, does anyone have any useful  
> numbers for what the hardware requirements are to run such an  
> engine? I have numbers for how fast I can get the crawler's  
> working. But not for how many pages can be served off of each  
> search node and how much processing power is required for the  
> indexing, etc.
> Second, what all needs to be done to Nutch yet in order for it to  
> be able to handle billions of pages? Is there a general list of  
> requirements?
> Third, if nutch isn't capable of doing what we need, what is the  
> expected upper limit for it? Using the map/reduce version.
> Thanks,
> -- 
> Ken van Mulder
> Wavefire Technologies Corporation
> 250.717.0200 (ext 113)

View raw message