nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: "db.max.outlinks.per.page" is misunderstood?
Date Wed, 07 Sep 2005 17:28:21 GMT
Hi Chen

I don't think it is the limitation of ONE page but ONE fetching phase (cycle).
In my previous example, 

feed urls:
http://www.a.com/index.php (90 outlinks)
http://www.b.com/index.jsp  (80 outlinks)
http://www.c.com/index.html (50 outlinks)
90 + 80 + 50 = 220 outlinks, they are totally different. And I used
protocol-httpclient plugin.
In one fetching cycle, if the sum of fecthing outlink is 100, then the
others will be abandoned. Right?

/Jack

On 9/8/05, AJ Chen <anjun.chen@sbcglobal.net> wrote:
> My understanding is that only up to the maximum number of outlinks are
> processed for a page when updating the web db. I assume the same page
> won't get fetched and processed again in the next fetch/update cycles,
> then you won't get those outlinks exceeding the maximum number no matter
> how many cycles you are running.
> 
> To make sure all of the outlinks are processed for a page, the
> db.max.outlinks.per.page must be set to a number that is larger than the
> number of outlinks on the page. If these is true, then the max number
> has to be determined in real time since the number of outlinks varies
> from page to page.
> 
> Is my understanding correct?
> 
> AJ
> 
> 
> Jack Tang wrote:
> 
> >Hi All
> >
> >Here is the "db.max.outlinks.per.page" property and its description in
> >nutch-default.xml
> >       <property>
> >         <name>db.max.outlinks.per.page</name>
> >         <value>100</value>
> >         <description>The maximum number of outlinks that we'll process for
a page.
> >         </description>
> >       </property>
> >
> >I don't think the description is right.
> >Say, my crawler feeds are:
> >http://www.a.com/index.php (90 outlinks)
> >http://www.b.com/index.jsp  (80 outlinks)
> >http://www.c.com/index.html (50 outlinks)
> >
> >and the number of crawler thread is 30. Do you think the reminder URLs
> >( (80 -10) outlinks + 50  outlinks) will be fetched?
> >
> >I think the description should be "The maximum number of outlinks in
> >one fecthing phase."
> >
> >
> >Regards
> >/Jack
> >
> >
> 
> --
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting
> Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, anjun.chen@sbcglobal.net
> ---------------------------------------------------
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Mime
View raw message