nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5
Date Sat, 10 Jun 2017 06:47:21 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045421#comment-16045421
] 

ASF GitHub Bot commented on NUTCH-2391:
---------------------------------------

sebastian-nagel opened a new pull request #194: NUTCH-2391 Spurious Duplications for MD5
URL: https://github.com/apache/nutch/pull/194
 
 
   [NUTCH-2391](https://issues.apache.org/jira/browse/NUTCH-2391): use URL for MD5 digest
as fall-back if content is empty, i.e., content length is zero (contributed by David Johnson)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Spurious Duplications for MD5
> -----------------------------
>
>                 Key: NUTCH-2391
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2391
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl
>    Affects Versions: 1.11
>            Reporter: David Johnson
>            Priority: Minor
>             Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as duplicate
in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for the content
field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if (data == null)
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if ((data == null) || (data.length == 0))
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to address the issue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message