nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Ignore external links from crawled domains
Date Mon, 08 Aug 2005 14:35:20 GMT
>A very basic facility seem to be missing in Nutch. If I have a 2000 
>urls list in Nutch DB and want to ignore external links, I have to 
>build a regex-filter with thousands of different domain I want to 
>crawl. No parameter to only crawl the different domain and ignore 
>external links.
>
>At these times, is there another solution ? Has anybody worked on that ?

We did something similar, though not exactly the same.

We've got a list of "favored domains", and we use this to boost link 
scores in the FetchListTool before sorting and selecting the topN. So 
you could easily apply the same approach to strip out any URLs that 
aren't in your domain set.

Another approach that I haven't tried would be to set the external 
link weight (db.score.link.external) to 0. So any new page added by a 
link that's "leaving" a domain effectively get a score of 0. Two 
problems I can think of are (a) if you have a link between pages from 
two of your target domains, this might cause problems, and (b) 
without mods to FetchListTool you still might wind up fetching a page 
with a score of 0.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Mime
View raw message