nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)
Date Thu, 03 Nov 2011 17:33:33 GMT
Update job should impose an upper limit on the number of inlinks (nutchgora)
----------------------------------------------------------------------------

                 Key: NUTCH-1196
                 URL: https://issues.apache.org/jira/browse/NUTCH-1196
             Project: Nutch
          Issue Type: Bug
            Reporter: Ferdy Galema
             Fix For: nutchgora


Currently the nutchgora branch does not limit the number of inlinks in the update job. This
will result in some nasty out-of-memory exceptions and timeouts when the crawl is getting
big. Nutch trunk already has a default limit of 10,000 inlinks. I will implement this in nutchgora
too. Nutch trunk uses a sorting mechanism in the reducer itself, but I will implement it using
standard Hadoop components instead (should be a bit faster). This means:

The keys of the reducer will be a {url,score} tuple.

*Partitioning* will be done by {url}.
*Sorting* will be done by {url,score}.
Finally *grouping* will be done by {url} again.

This ensures all indentical urls will be put in the same reducer, but in order of scoring.

Patch should be ready by tomorrow. Please let me know when you have any comments or suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message