nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel D." <nutchfo...@gmail.com>
Subject Analyze command purpose ....
Date Thu, 16 Jun 2005 15:06:30 GMT
Dear Nutch Developers,

I'm trying to get answers to my questions below but nobody is responding. 
This is why I'm trying to post my questions again.

----------- Question # 1 ------------------------
As I understand Nutch crawler is employing crawl & stop with threshold is 
used with –topN parameter. Please correct me if I'm wrong. This also means 
that some sites will have different depth the others.

Is there a way to control the crawling depth per domain and number of URLS 
per domain as well as the total number of domains crawled (in this case it's 
- topN).

----------- Question # 2 ------------------------
The whole-web crawling tutorial advices to use the following command 
sequence:

Fetch

updatedb db

and then generate db segments -topN 1000

Use of the topN parameter implies that updatedb db doing some analysis on 
fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not 
being mentioned in tutorial. DissectingTheNutchCrawler ( 
http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes 
this command in the sequance of command for the whole-internaet crawling.

When should I use command analyze and when might I not use it?

I'm trying to get sense on how much memory (hard-drive and RAM) webDB will 
require and now I also will concern about how much machine resources will 
analyze consume. Nobody provide this information yet. I will appreciate if 
somebody will share his knowledge and thoughts here.

I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.


----------- Question # 3 ------------------------


After initial inject and subsequent fetch and updatedb command (s) can I use 
inject to add more URLS to the WebDB ?

 Will greatly appreciate your help.

 Thanks,
Daniel
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message