nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doğacan Güney" <>
Subject Re: Re: Creating a new scoring filter.
Date Mon, 26 Feb 2007 07:20:03 GMT

On 2/24/07, Nicolás Lichtmaier <> wrote:
> >> Hi, I'm working in a fixed set of URLs and I'd like to replace the
> >> standard OPIC score plugin with something different. I'd like to
> >> create a scoring plugin which entirely bases its score on the
> >> document parsed data (yes, I will trust the document text itself to
> >> decide its relevance).
> >>
> >> I've been reading the code and the ScoringFilter interface seems to
> >> be targeted for use by OPIC like algorithms. For example, the step
> >> called after parsing is called "passScoreAfterParsing()", telling me
> >> what am I supposed to do in that method, and the method setting the
> >> scores is called "distributeScoreToOutlink()". All of this scares
> >> me... would it be safe to use these methods differently and, e.g.,
> >> modify the socument score in "passScoreAfterParsing()" instead of
> >> just "passing it"?
> >
> > You can modify whichever way you want - it's up to you. These methods
> > simply ensure that the score data (not just the CrawlDatum.getScore(),
> > but possibly a multitude of metadata collected on the way) is passed
> > to appropriate segment parts.
> >
> > E.g. in distributeScoreToOutlink() you could simply set the default
> > score for new pages to a fixed value, without actually using the score
> > information from the source page.
> >
> Yeah, but there I don't have the parse data for those new pages. What I
> would like to do is override "passScoreAfterParsing()" and not pass
> anything: just analyze the parsed data and decide a score. The problem
> is that that function doesn't get passed the CrawlDatum... it seems I'll
> need to modify Nutch itself.... =(

Can you be a bit more specific about your problem?

Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?

> Thanks!

Doğacan Güney
View raw message