nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dogacan Güney (JIRA) <j...@apache.org>
Subject [jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
Date Tue, 26 Dec 2006 11:32:24 GMT
     [ http://issues.apache.org/jira/browse/NUTCH-420?page=all ]

Dogacan Güney updated NUTCH-420:
--------------------------------

    Attachment: dedup.patch

Patch for the problem. This patch also slightly refactors the code.

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: http://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup.patch
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the
second than second doc will be deleted. But if the first has lower score than the second then
none will be deleted. AFAICS, there should be an else condition to delete value and keep highest
as it is.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message