nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <anjun.c...@sbcglobal.net>
Subject Re: "db.max.outlinks.per.page" is misunderstood?
Date Wed, 07 Sep 2005 17:10:46 GMT
My understanding is that only up to the maximum number of outlinks are 
processed for a page when updating the web db. I assume the same page 
won't get fetched and processed again in the next fetch/update cycles, 
then you won't get those outlinks exceeding the maximum number no matter 
how many cycles you are running.

To make sure all of the outlinks are processed for a page, the 
db.max.outlinks.per.page must be set to a number that is larger than the 
number of outlinks on the page. If these is true, then the max number 
has to be determined in real time since the number of outlinks varies 
from page to page. 

Is my understanding correct?

AJ


Jack Tang wrote:

>Hi All
>
>Here is the "db.max.outlinks.per.page" property and its description in
>nutch-default.xml
>	<property>
>	  <name>db.max.outlinks.per.page</name>
>	  <value>100</value>
>	  <description>The maximum number of outlinks that we'll process for a page.
>	  </description>
>       </property>
>
>I don't think the description is right.
>Say, my crawler feeds are:
>http://www.a.com/index.php (90 outlinks)
>http://www.b.com/index.jsp  (80 outlinks)
>http://www.c.com/index.html (50 outlinks)
>
>and the number of crawler thread is 30. Do you think the reminder URLs
>( (80 -10) outlinks + 50  outlinks) will be fetched?
>
>I think the description should be "The maximum number of outlinks in
>one fecthing phase."
>
>
>Regards
>/Jack
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message