nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Chan (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment
Date Sat, 28 Feb 2009 17:42:12 GMT
Generation of multiple segments in multiple runs returns only 1 segment
-----------------------------------------------------------------------

                 Key: NUTCH-707
                 URL: https://issues.apache.org/jira/browse/NUTCH-707
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0
         Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b.
            Reporter: Michael Chan
             Fix For: 0.9.0


To generate multiple segments, generator.update.crawldb is set to true and -topN is defined
to be the size of the segments. However, only one segment of size N is generated.

For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb
is set to true and -topN is set to 5, only 1 segment of size 5 is produced.

It seems to me the problem is due to an incorrect recording of generation time. Selector.map
assigns the generation time to each URL, even reduce only collects N many. It's perfectly
fine if the generator was run once and that the db isn't updated. In the situation where the
generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest
the generation time should be assigned in reduce rather than map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message