nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorge Luis Betancourt Gonzalez (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Date Thu, 04 Jun 2015 12:52:37 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jorge Luis Betancourt Gonzalez updated NUTCH-2036:
--------------------------------------------------
    Attachment: NUTCH-2036.patch

> Adding some continuous crawl goodies to the crawl script
> --------------------------------------------------------
>
>                 Key: NUTCH-2036
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2036
>             Project: Nutch
>          Issue Type: Improvement
>          Components: bin, tool, util
>    Affects Versions: 1.10, 1.11
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Minor
>              Labels: crawl, script
>         Attachments: NUTCH-2036.patch
>
>
> Although Nutch does not support continuous crawling out of the box, and yes this is somehow
doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature
to have. 
> This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait)
which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching).

> This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the
amount of time is assumed to be in seconds. Other valid suffixes are: 
> s - second
> m - minutes
> h - hours
> d - days
> If a {{-1}} value is passed to the parameter or its not used at all the default behaviour
of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message