nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Date Thu, 04 Jun 2015 14:11:38 GMT


Julien Nioche commented on NUTCH-2036:


Note that this patch allows also to handle cases where we set -1 as value for the number of
rounds, in which case the crawl never stops. This would often be used in combination with
the brand new 'wait' parameter.

> Adding some continuous crawl goodies to the crawl script
> --------------------------------------------------------
>                 Key: NUTCH-2036
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: bin, tool, util
>    Affects Versions: 1.10, 1.11
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Minor
>              Labels: crawl, script
>         Attachments: NUTCH-2036.patch
> Although Nutch does not support continuous crawling out of the box, and yes this is somehow
doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature
to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait)
which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the
amount of time is assumed to be in seconds. Other valid suffixes are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the default behaviour
of exciting the script is used.

This message was sent by Atlassian JIRA

View raw message