nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From UroŇ° Gruber <>
Subject Re: [Fwd: Re: get CrawlDatum]
Date Thu, 07 Sep 2006 05:53:38 GMT
Andrzej Bialecki wrote:
> UroŇ° Gruber wrote:
>> I made some draft patch. But there is still some problems I see. I 
>> know code needs to be cleaned and test. But right now I don't know 
>> what number set to external urls. For internal linking works great.
> (the patch changes CrawlDatum itself, I think it would be better to 
> put the hop counter in CrawlDatum.metaData.)
I can try to make with metaData
>> What is the whole idea of this changes.
>> Injected urls always get hop 0. While fetching/updating/generating 
>> hop value is incremented by 1. (still no idea what to do with 
>> external link). Then I can add config value max_hop etc. to limit 
>> fetcher and generator to create more urls.
>> This way it's possible to limit crawling vertically
>> Comments are welcome.
> Well, it really depends on what you want to do when you encounter an 
> external link. Do you want to restart the counter, i.e. crawl the new 
> site at full depth up to max_hop? Then set hop=0. Do you want to 
> terminate the crawl at that link? then set hop=max_hop.
I talk with my friend about this and here is what we've came up. Let say 
URLs manualy injected are good and checked by human and probably you 
wan't to start from it. So setting hop to 0 at injection is ok. While 
crawling we have some sort of filtering by host (regexp etc.). We need 
no worry about urls we don't have in our list so hop can be set whatever 
it's, maybe to max_hop.

But here a scenario We add and from injection. After 
crawling we find on site link to
We can set url hop to 0 or to max because we can update this after we 
found this url on site.

Checking for hop needs to be done while updating I think, so we don't 
end up with bunch of urls having hop greater than max_hop.

I will try to make a decent patch for this to check and if there is any 
idea by others please make a comment on this.



View raw message