nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-415) Generate should mark selected records in crawlDB
Date Fri, 15 Dec 2006 12:32:27 GMT
Generate should mark selected records in crawlDB
------------------------------------------------

                 Key: NUTCH-415
                 URL: http://issues.apache.org/jira/browse/NUTCH-415
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8.1, 0.8, 0.8.2, 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
             Fix For: 0.8.2, 0.9.0


In Nutch 0.7.x, if user ran "generate" twice without intervening "updatedb", each fetchlist
would be different, because "generate" would mark selected entries as "being fetched" (by
moving their fetch time one week forward).

In Nutch 0.8 and later, crawldb is not modified at all during "generate". This means that
two "generate"-s run without intervening "updatedb" will create exactly the same fetchlists,
which is undesirable.

I propose to re-implement this feature, using the same mechanism. CrawlDB update would be
performed simultaneously with the first mapred job in Generator, and a modified crawldb content
would be produced together with an (unsorted) fetchlist in Selector, using a custom OutputFormat
(patches to follow ;) ). Additionally, to ensure that correct version of modified crawldb
is installed, I propose to add a locking mechanism, which prevents from running two processes
that modify crawldb simultaneously.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message