nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorge Luis Betancourt Gonzalez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Date Thu, 04 Jun 2015 12:51:38 GMT
Jorge Luis Betancourt Gonzalez created NUTCH-2036:
-----------------------------------------------------

             Summary: Adding some continuous crawl goodies to the crawl script
                 Key: NUTCH-2036
                 URL: https://issues.apache.org/jira/browse/NUTCH-2036
             Project: Nutch
          Issue Type: Improvement
          Components: bin, tool, util
    Affects Versions: 1.10, 1.11
            Reporter: Jorge Luis Betancourt Gonzalez
            Priority: Minor


Although Nutch does not support continuous crawling out of the box, and yes this is somehow
doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature
to have. 

This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait)
which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).


This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the amount
of time is assumed to be in seconds. Other valid suffixes are: 

s - second
m - minutes
h - hours
d - days

If a {{-1}} value is passed to the parameter or its not used at all the default behaviour
of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message