nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Scoring API issues (LONG)
Date Thu, 18 Oct 2007 16:40:05 GMT
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Hi all,
>>
>> I've been working recently on a custom scoring plugin, and I found out
>> some issues with the scoring API that severely limit the way we can
>> calculate static page scores. I'd like to restart the discussion about
>> this API, and propose some changes. Any comments or suggestions are
>> welcome!
> 
> Hi,
> 
> In practice I have found out that sometimes it's just easier (and even
> more efficient) to write a custom mr job (yes, an additional phase into
> the process) to calculate the scores for urls.

Same here. E.g. PageRank calc. requires running a separate job. Other 
scoring techniques that use a post-processed linkgraph also require 
running a separate MR job.

> By using this strategy it would give users more freedom in selecting the
> data (and algorithm) required and same time keep the other parts of the
> process more slim.

Right .. except the main (supposed) benefit of OPIC was that it would be 
possible to avoid running an additional analysis step - the scores were 
supposed to be re-calculated online as a part of other steps. It's not 
worked out this way, as we know, but this was the main motivation for 
introducing the scoring API ... although it seems more and more that 
this API is just a glorified OPIC, and it's not sufficiently re-usable 
to benefit other scoring algorithms ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message