nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tien Nguyen Manh (JIRA)" <>
Subject [jira] [Created] (NUTCH-1693) TextMD5Signatue compute on textual content
Date Fri, 03 Jan 2014 04:25:50 GMT
Tien Nguyen Manh created NUTCH-1693:

             Summary: TextMD5Signatue compute on textual content
                 Key: NUTCH-1693
             Project: Nutch
          Issue Type: Bug
            Reporter: Tien Nguyen Manh
            Priority: Minor

I create a new MD5Signature that based on textual content. In our case we use boilerpipe to
extract main text from content so this signature is more effective to deduplicate.

This message was sent by Atlassian JIRA

View raw message