nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2628) Fetcher: optionally generate signature of unparsed content
Date Fri, 27 Jul 2018 13:43:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559774#comment-16559774
] 

ASF GitHub Bot commented on NUTCH-2628:
---------------------------------------

sebastian-nagel opened a new pull request #371: NUTCH-2628 Fetcher: optionally generate signature
of unparsed content
URL: https://github.com/apache/nutch/pull/371
 
 
   - add property fetcher.signature to make fetcher generate a signature even if fetcher is
not parsing
   - move comment about following meta-redirects for multi-parse ParseResults into the right
position

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fetcher: optionally generate signature of unparsed content
> ----------------------------------------------------------
>
>                 Key: NUTCH-2628
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2628
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> To generate a document signature (MD5 digest) of the binary content requires that documents
are parsed during the parse or fetch step. The signature is required for deduplication and
detection of unmodified content and should be always available, also in a workflow which does
not require that documents are parsed, e.g., because HTML content is exported to WARC files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message