nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Closed: (NUTCH-321) Scoring API deficiency
Date Wed, 19 Jul 2006 22:42:14 GMT
     [ ]

Andrzej Bialecki  closed NUTCH-321.

    Resolution: Fixed

Patch applied to trunk/ .

NOTE: this requires a (trivial) change in any custom scoring plugin. Most likely, to accomodate
for the future support for interleaved fetching cycles, you should use the "old" CrawlDatum
as a basis for the initial score to be updated, instead of the "datum" (which is a snapshot
of the value at the time of generating the fetchlist).

> Scoring API deficiency
> ----------------------
>                 Key: NUTCH-321
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>         Attachments: patch.txt
> Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing
CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which
represents a snapshot of the "old" value taken at the moment of generating the fetchlist.
> The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update
cycles, the initial score values in CrawlDatum instance that comes from the current segment
could be already outdated, if another updatedb was run in the meantime, which changed the
DB score.
> For this reason we should always assume that the value from CrawlDB, if exists, represents
the most recent version of CrawlDatum before the update, and use this instance as a base.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message