nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2408) CrawlDb: allow update from unparsed segments
Date Sat, 12 Aug 2017 14:24:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124584#comment-16124584
] 

ASF GitHub Bot commented on NUTCH-2408:
---------------------------------------

sebastian-nagel opened a new pull request #212: NUTCH-2408 CrawlDb: allow update from unparsed
segments
URL: https://github.com/apache/nutch/pull/212
 
 
   - use unparsed segments to update status in CrawlDb
   - but log if unparsed segment is detected (priorly unparsed segments where logged and skipped)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> CrawlDb: allow update from unparsed segments
> --------------------------------------------
>
>                 Key: NUTCH-2408
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2408
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.13
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.14
>
>
> The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the CrawlDb
with fetch status only (from segment subdirectory crawl_fetch) without also reading crawl_parse
(which contains outlinks but also scores, signatures and meta data). 
> A workflow which does not require parsing of documents (e.g., because raw HTML content
is exported to WARC files) is then unable to update the CrawlDb to store the fetch status.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message