nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolás Lichtmaier <n...@reloco.com.ar>
Subject Re: Creating a new scoring filter.
Date Tue, 27 Feb 2007 15:41:41 GMT

>> Yeah, but there I don't have the parse data for those new pages. What I
>> would like to do is override "passScoreAfterParsing()" and not pass
>> anything: just analyze the parsed data and decide a score. The problem
>> is that that function doesn't get passed the CrawlDatum... it seems I'll
>> need to modify Nutch itself.... =(
> Can you be a bit more specific about your problem?

I'm indexing a fixed set of URLs that I think are a specific type of 
document. I don't care about links (I'm using -noAdditions to prevent 
adding links to crawldb, I've backported that to 0.8.x and it's waiting 
for somebody to commit it =) 
https://issues.apache.org/jira/browse/NUTCH-438 ).

I just want to replace the scoring algorithm with one which test if that 
URL really is that specific type of document. I want to use the parse 
data of a document to calculate its relevance.

> Anyway, without the details, here is my guess on how you can do it:
> 1) In passScoreAfterParsing(), analyze the content and parse text and
> put the relevant score information in parse data's metadata.
> 2) In distributeScoreToOutlink() ignore the outlinks (just give them
> initialScore()),
> but check your parse data and return an adjust datum with the status
> STATUS_LINKED and score extracted from parse data. This adjust datum
> will update the score of the original datum in updatedb.
>
> Does this work for you?

It doesn't seem a good way to do it. What if there are no outlinks? This 
method won't be called at all. And anyway, it would be called once per 
each outlink, which would multiplicate the work.

Thanks!


Mime
View raw message