nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorge Luis Betancourt Gonzalez (JIRA)" <>
Subject [jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Date Thu, 04 Jun 2015 12:52:37 GMT


Jorge Luis Betancourt Gonzalez updated NUTCH-2036:
    Attachment: NUTCH-2036.patch

> Adding some continuous crawl goodies to the crawl script
> --------------------------------------------------------
>                 Key: NUTCH-2036
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: bin, tool, util
>    Affects Versions: 1.10, 1.11
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Minor
>              Labels: crawl, script
>         Attachments: NUTCH-2036.patch
> Although Nutch does not support continuous crawling out of the box, and yes this is somehow
doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature
to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait)
which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the
amount of time is assumed to be in seconds. Other valid suffixes are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the default behaviour
of exciting the script is used.

This message was sent by Atlassian JIRA

View raw message