nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: "db.max.outlinks.per.page" is misunderstood?
Date Wed, 07 Sep 2005 17:44:38 GMT
Thanks Chen, I will try that:)

On 9/8/05, AJ Chen <anjun.chen@sbcglobal.net> wrote:
> Jack,
> Set the max to 100, but run 10 cycles (i.e., depth=10) with the
> CrawlTool. You may see all the outlinks are collected toward the end.  3
> cycles is usually not enough.
> -AJ
> 
> Jack Tang wrote:
> 
> >Yes, Stefan.
> >But it missed some URLs, and I set the value to 3000, then everything is OK
> >
> >/Jack
> >
> >On 9/8/05, Stefan Groschupf <sg@media-style.com> wrote:
> >
> >
> >>Jack,
> >>That is max outlinks per html page.
> >>All your example pages have less than 100 outlinks, right?!
> >>Stefan
> >>
> >>Am 07.09.2005 um 18:43 schrieb Jack Tang:
> >>
> >>
> >>
> >>>Hi All
> >>>
> >>>Here is the "db.max.outlinks.per.page" property and its description in
> >>>nutch-default.xml
> >>>    <property>
> >>>      <name>db.max.outlinks.per.page</name>
> >>>      <value>100</value>
> >>>      <description>The maximum number of outlinks that we'll
> >>>process for a page.
> >>>      </description>
> >>>       </property>
> >>>
> >>>I don't think the description is right.
> >>>Say, my crawler feeds are:
> >>>http://www.a.com/index.php (90 outlinks)
> >>>http://www.b.com/index.jsp  (80 outlinks)
> >>>http://www.c.com/index.html (50 outlinks)
> >>>
> >>>and the number of crawler thread is 30. Do you think the reminder URLs
> >>>( (80 -10) outlinks + 50  outlinks) will be fetched?
> >>>
> >>>I think the description should be "The maximum number of outlinks in
> >>>one fecthing phase."
> >>>
> >>>
> >>>Regards
> >>>/Jack
> >>>--
> >>>Keep Discovering ... ...
> >>>http://www.jroller.com/page/jmars
> >>>
> >>>
> >>>
> >>>
> >>---------------------------------------------------------------
> >>company:        http://www.media-style.com
> >>forum:        http://www.text-mining.org
> >>blog:            http://www.find23.net
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> 
> --
> AJ (Anjun) Chen, Ph.D.
> Canova Bioconsulting
> Marketing * BD * Software Development
> 748 Matadero Ave., Palo Alto, CA 94306, USA
> Cell 650-283-4091, anjun.chen@sbcglobal.net
> ---------------------------------------------------
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Mime
View raw message