nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <anjun.c...@sbcglobal.net>
Subject Re: "db.max.outlinks.per.page" is misunderstood?
Date Wed, 07 Sep 2005 17:40:34 GMT
Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the 
CrawlTool. You may see all the outlinks are collected toward the end.  3 
cycles is usually not enough.
-AJ

Jack Tang wrote:

>Yes, Stefan.
>But it missed some URLs, and I set the value to 3000, then everything is OK
>
>/Jack
>
>On 9/8/05, Stefan Groschupf <sg@media-style.com> wrote:
>  
>
>>Jack,
>>That is max outlinks per html page.
>>All your example pages have less than 100 outlinks, right?!
>>Stefan
>>
>>Am 07.09.2005 um 18:43 schrieb Jack Tang:
>>
>>    
>>
>>>Hi All
>>>
>>>Here is the "db.max.outlinks.per.page" property and its description in
>>>nutch-default.xml
>>>    <property>
>>>      <name>db.max.outlinks.per.page</name>
>>>      <value>100</value>
>>>      <description>The maximum number of outlinks that we'll
>>>process for a page.
>>>      </description>
>>>       </property>
>>>
>>>I don't think the description is right.
>>>Say, my crawler feeds are:
>>>http://www.a.com/index.php (90 outlinks)
>>>http://www.b.com/index.jsp  (80 outlinks)
>>>http://www.c.com/index.html (50 outlinks)
>>>
>>>and the number of crawler thread is 30. Do you think the reminder URLs
>>>( (80 -10) outlinks + 50  outlinks) will be fetched?
>>>
>>>I think the description should be "The maximum number of outlinks in
>>>one fecthing phase."
>>>
>>>
>>>Regards
>>>/Jack
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------
>>company:        http://www.media-style.com
>>forum:        http://www.text-mining.org
>>blog:            http://www.find23.net
>>
>>
>>
>>
>>    
>>
>
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message