nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Optimizing which links to fetch
Date Mon, 20 Jun 2005 13:48:27 GMT
Hi all,

It seems that the default behavior of Nutch when sorting links to 
fetch is to use scoreByLinkCount. This then sets the fetch score for 
links on a page to be the same as the containing page's "in-bound 
link" score (or actually the log of same).

What I'd like to do is rate each link on a page separately, based on 
its proximity to key words and other calculated hot-spots. Has this 
been done before? Is the support already there, and I haven't found 
it yet?

If I need to do it myself, the most straightforward approach would be 
to modify emitFetchList() to parse each page (from webdb.pages()), 
matching up the anchors with what's returned by 
dbAnchors.getanchors(). But this seems inefficient and awkward. Would 
it be better to do this analysis when parsing the HTML originally, 
and somehow save each anchor's score in the web DB?

Thanks,

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Mime
View raw message