nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Date Thu, 25 Jun 2015 10:47:04 GMT


Markus Jelsma commented on NUTCH-2036:

Seems fine to me :)

> Adding some continuous crawl goodies to the crawl script
> --------------------------------------------------------
>                 Key: NUTCH-2036
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: bin, tool, util
>    Affects Versions: 1.10
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Minor
>              Labels: crawl, script
>             Fix For: 1.11
>         Attachments: NUTCH-2036.patch
> Although Nutch does not support continuous crawling out of the box, and yes this is somehow
doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature
to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait)
which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the
amount of time is assumed to be in seconds. Other valid suffixes are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the default behaviour
of exciting the script is used.

This message was sent by Atlassian JIRA

View raw message