nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Drazner" <dan...@domainspa.com>
Subject RE: [Nutch-dev] Nutch Crawler !!!
Date Tue, 08 Mar 2005 03:13:31 GMT
Thanks a lot.

I also started to run Nutch in debug mode. It's interesting experience but
any Tech documentation will definitely save me some time.

Will wait to see what others have to add here.

Thanks,
Daniel

-----Original Message-----
From: nutch-developers-admin@lists.sourceforge.net
[mailto:nutch-developers-admin@lists.sourceforge.net]On Behalf Of Feng Zhou
Sent: Monday, March 07, 2005 9:40 PM
To: dev@nutch.org
Cc: nutch-developers@lists.sourceforge.net
Subject: Re: [Nutch-dev] Nutch Crawler !!!

I've been reading Nutch code recently. Below's some of my
understanding. Others should correct me if I'm wrong.

Regards,
- Feng Zhou

On Mon, 7 Mar 2005 21:15:38 -0500, Daniel Drazner <daniel@domainspa.com>
wrote:
> Hi,
>
> Thanks for your email. I have come across this article before.
Unfortunately
> it doesn't reveal all secrets.
>
> Thanks,
> Daniel
>
> >
> > 1. Perform DNS lookups. DNS results caching in memory and DB.

I don't think Nutch caches DNS lookup results. The Java class library
itself has a built-in cache though. So it definitely won't look up the
same URL again immediately after it is looked up.

> > 2. How crawler is dealing with URL duplicates in the same crawl.

Crawling is done in rounds of "fetches". Each round operates on a
predetermined "fetchlist" generated by the "nutch generate" tool. That
said, newly discovered URLs are not immediately crawled. They get
added to the WebDB after the current round of fetch finishes. The
separate tool for this is "nutch udpatedb". The WebDB is indexed by
URL. Each URL has at most one entry in it. So you can say duplicates
are eliminated when DB update is done.

> > 3. Robots.txt parsing and caching.

Haven't looked at code for this yet.

> > 4. What crawler is doing when new link encountered. Parsing? Adding to
his
> > queue? Adding to all threads queue?

See question 2 above.

> > 5. What memory structure is used for memory caching, page content
caching,
> > DB.

The crawler presumably does need to do much caching. It essentially
goes over the fetch list (essentially a linear list), get each page,
parses it and append the results to the files. No global data
structure is needed.

> > 6. What Page rank and Link analysis techniques are used.

I believe it's the vanilla pagerank algorithm.

> >
> > Will greatly appreciate if somebody will point me the proper
> documentation.
> >
> > Thanks,
> > Daniel
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Nutch-developers mailing list
> > Nutch-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >
>
> --
> - Feng
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Mime
View raw message