nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2491) Integrate sitemap processing and HostDB into crawl script
Date Wed, 03 Jan 2018 14:18:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309702#comment-16309702
] 

ASF GitHub Bot commented on NUTCH-2491:
---------------------------------------

mfeltscher opened a new pull request #270: NUTCH-2491: Integrate sitemap processing and HostDB
into crawl script
URL: https://github.com/apache/nutch/pull/270
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Integrate sitemap processing and HostDB into crawl script
> ---------------------------------------------------------
>
>                 Key: NUTCH-2491
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2491
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Moreno Feltscher
>            Priority: Minor
>
> Add three new steps to the crawl bash script:
> 1. Generate HostDB from CrawlDB
> 2. Inject URLs from sitemaps URLs found in hosts from HostDb
> 3. If given, inject sitemap URLs specified in a configuration file / in configuration
files



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message