nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Closed: (NUTCH-415) Generate should mark selected records in crawlDB
Date Thu, 28 Dec 2006 00:10:22 GMT
     [ ]

Andrzej Bialecki  closed NUTCH-415.

    Fix Version/s:     (was: 0.8.2)
       Resolution: Fixed

Fixed in trunk, rev. 490607 .  Locking has been added, but it's still possible to force generate/update
to work with a locked DB by using a "-force" command-line switch.

Generation time is recorded in the fetchlist, and optionally in CrawlDB. If CrawlDatum in
CrawlDB contains this generation time, Generator will check if generate.crawl.delay elapsed
(7 days by default), and only then it will again include the CrawlDatum in new fetchlists.
During updatedb this marker value is removed from CrawlDB entries.

> Generate should mark selected records in crawlDB
> ------------------------------------------------
>                 Key: NUTCH-415
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8, 0.9.0, 0.8.1, 0.8.2
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
> In Nutch 0.7.x, if user ran "generate" twice without intervening "updatedb", each fetchlist
would be different, because "generate" would mark selected entries as "being fetched" (by
moving their fetch time one week forward).
> In Nutch 0.8 and later, crawldb is not modified at all during "generate". This means
that two "generate"-s run without intervening "updatedb" will create exactly the same fetchlists,
which is undesirable.
> I propose to re-implement this feature, using the same mechanism. CrawlDB update would
be performed simultaneously with the first mapred job in Generator, and a modified crawldb
content would be produced together with an (unsorted) fetchlist in Selector, using a custom
OutputFormat (patches to follow ;) ). Additionally, to ensure that correct version of modified
crawldb is installed, I propose to add a locking mechanism, which prevents from running two
processes that modify crawldb simultaneously.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message