nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: [Fwd: Re: get CrawlDatum]
Date Wed, 06 Sep 2006 19:31:27 GMT
UroŇ° Gruber wrote:
> I made some draft patch. But there is still some problems I see. I 
> know code needs to be cleaned and test. But right now I don't know 
> what number set to external urls. For internal linking works great.

(the patch changes CrawlDatum itself, I think it would be better to put 
the hop counter in CrawlDatum.metaData.)

>
> What is the whole idea of this changes.
>
> Injected urls always get hop 0. While fetching/updating/generating hop 
> value is incremented by 1. (still no idea what to do with external 
> link). Then I can add config value max_hop etc. to limit fetcher and 
> generator to create more urls.
>
> This way it's possible to limit crawling vertically
>
> Comments are welcome.

Well, it really depends on what you want to do when you encounter an 
external link. Do you want to restart the counter, i.e. crawl the new 
site at full depth up to max_hop? Then set hop=0. Do you want to 
terminate the crawl at that link? then set hop=max_hop.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message