nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: nutch/lucene question...
Date Fri, 25 Aug 2006 21:15:52 GMT
bruce wrote:
> hi...
>
> if it's ok, i've got some basic research questions.
>
> can someone tell me if there's a limit to the number of simultaneous
> websites that nutch/lucence can return...?
>   
I assume you are asking its indexing capacity.  If that is the case it 
is billions, it is pretty much dependent
only on the hardware and bandwidth.
> i'm assuming the nutch/lucene writes the text information from the crawl
> back to a db. can someone tell me if there's a limit to the number of pages
> that can be written to the db in a simultaneous manner...
>   
The crawl process works over a cluster of machines in parallel.  Each 
fetcher grabs webpages in parallel
with the others and then those pages are reduced into a number of binary 
files called the crawl database.
This is not a sql database.  While the fetching can be massively 
parallel and is again dependent only on
hardware, writing the results to the crawl database usually happens on a 
single machine using a single job
and is serial in nature.
> from what i've seen, you can setup nutch/lucene to use multiple servers to
> do the search. how do these child servers go about adding their information
> from the crawl to the overall db....
>
>   
Once the pages are fetched they are processed for links and content and 
then go through an indexing process
to create binary index files in the lucene format.  This usually happens 
on a distributed file system.  Those index
files are then moved to local file system for searching. 

You would have multiple search servers to support search capacity, but 
those search servers don't alter the indexes. 
Creation and manipulation of the indexes happens as batch jobs using map 
reduce processes way before the indexes
are ever being searched.  Once created they are usually not changed just 
continually searched (meaning read from disk). 
Multiple search servers would aggregate their search results together 
before results are returned to the user. 
> thanks
>
> -bruce
>
>   

Mime
View raw message