nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From UroŇ° Gruber <>
Subject [Fwd: Re: get CrawlDatum]
Date Wed, 06 Sep 2006 17:43:05 GMT
A while ago I posted this on dev list but without reply. I wonder if 
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and 
you've already talked about this.



Andrzej Bialecki wrote:
> UroŇ° Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>> but i'm not sure that datum holds info of url being fetched
> On the input to the fetcher you get a URL and a CrawlDatum (originally 
> coming from the crawldb). Check for example how the segment name is 
> passed around in metadata, you can use the same method.

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

View raw message