nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Johnson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2391) Spurious Duplications for MD5
Date Mon, 05 Jun 2017 18:09:04 GMT
David Johnson created NUTCH-2391:
------------------------------------

             Summary: Spurious Duplications for MD5
                 Key: NUTCH-2391
                 URL: https://issues.apache.org/jira/browse/NUTCH-2391
             Project: Nutch
          Issue Type: Bug
          Components: commoncrawl
    Affects Versions: 1.11
            Reporter: David Johnson
            Priority: Minor


We're seeing some incidence of a large number of documents being marked as duplicate in our
crawl.

We traced it back to one of the crawl plugins returning an empty array for the content field.

We'd like to propose changing the MD5 signature generation from:
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if (data == null)
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }

to:
public byte[] calculate(Content content, Parse parse) {
    byte[] data = content.getContent();
    if ((data == null) || (data.length == 0))
      data = content.getUrl().getBytes();
    return MD5Hash.digest(data).getDigest();
  }

to address the issue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message