nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolás Lichtmaier <>
Subject Re: Creating a new scoring filter.
Date Tue, 27 Feb 2007 17:19:30 GMT

>> It doesn't seem a good way to do it. What if there are no outlinks? This
>> method won't be called at all. And anyway, it would be called once per
>> each outlink, which would multiplicate the work.
> Multiplication is easy to solve but you are right that it won't work
> if there are no outlinks.
> Maybe scoring filter api should change? A distributeScoreToOutlinks
> method may be more useful than the current one: (which will be called
> even if there are no outlinks)
> CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> toUrlList,   List<CrawlDatum> datumList, ParseData parseData,
> CrawlDatum adjust)
> This method gives more control to the plugin since knowing all the
> outlinks the plugin can make more informed decisions. Like, right now,
> there is no way a scoring filter can be sure that it has distributed
> all its cash (e.g if is 0.5 and
> is 1.0, filter will almost always distribute
> less than its cash).
> This will also work for your case, since you will just ignore the
> outlinks and return the adjust datum based on information in parse
> metadata.
> What do you (and others) think?

I think that good API design here means not assuming so many things 
about the plugin behaviour. You are right about this 
"distributeScoreToOutlinks()", but IMO it should be called something 
like assignScores(). Then you could add an abstract class 
DistributingScorePlugin (implementing the interface) which overrides 
assignScores() and calls an "abstract protected" method called 
distributeScoreToOutlink().". So the code for traversing the outlinks 
would be in DistributingScorePlugin.

I would need another class, called ContentBasedScorePlugin. That class 
could call an abstract protected method called calculateScore() which 
would receive the parsed data and return the score.

What do you think?

View raw message