nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: Nutch crawler is breadth-first ?
Date Wed, 07 Sep 2005 09:31:04 GMT
Hi

I found the reason. The value of maximum number of outlinks that nutch
willl process for a page is 100. And the website contains more than
300 URLs in the page.
Now, everything is ok.

/Jack

On 9/7/05, Jack Tang <himars@gmail.com> wrote:
> Hi Andrzej
> 
> First of all, thanks for your quick response.
> 
> On 9/7/05, Andrzej Bialecki <ab@getopt.org> wrote:
> > Jack Tang wrote:
> > > Hi All
> > >
> > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > > while I try do breadth-first crawling, I set the depth to 3.
> > > Any comments?
> >
> > Yes, and yes - there is a possiblity that some urls are lost, if they
> > require maintaining a single session. If you encounter such sites, a
> > depth-first crawler would be better.
> 
> The website does not require maintaining a single session.
> my experimentation is designed like this:
> 
> X.html contains a list of URLs, say
> http://www.a.com/x1.html
> http://www.a.com/x2.html
> http://www.a.com/x3.html
> http://www.a.com/x4.html
> http://www.a.com/x5.html
> http://www.a.com/x6.html
> http://www.a.com/x7.html
> ....
> http://www.a.com/x30.html
> 
> I set the depth of crawler is 3 and X.html as its url feed.
> And I use urlfilter-prefix as URL filter. (prefix=http://www.a.com)
> In my parser, I count the URL, it is 10.
> 
> However, If I put all 30 URL into url feed file, in parser, it is right.
> Odd?
> 
> Regards
> /Jack
> > It's not too difficult to build one, using the tools already present in
> > Nutch. Contributions are welcome... ;-)
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Mime
View raw message