nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)
Date Fri, 04 Nov 2011 15:55:51 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-1196:
--------------------------------

    Attachment: NUTCH-1196.patch

Patch done. It applies the db.update.max.inlinks just like Nutch trunk. Please note that the
property was already present in nutch-default.xml (but obviously it was doing nothing yet).

The patch adds a class named UrlWithScore with an extensive test class. This class makes it
easy to integrate this way of sorting into Nutch trunk too. (Feel free to do if anyone is
up for this). An advantage to implementing it with Hadoop sorting is that all keys enter the
reducer in order. This makes way for the following improvement: Removing the upper limit altogether
by using entirely Iterator interfaces and therefore eliminating memory consuming collections
(bottlenecks). This requires changes to the ScoringFilter interface (change lists to iteraters)
and possible Gora classes (allow partial puts to avoid big local buffers), but once that is
done we are able to completely remove this limit. After all we don't like limits when talking
MapReduce right? :) Anyway this is just a suggestion for followup improvements. This issue
is for simply adding the inlinks limit.

Feedback is much appreciated.
                
> Update job should impose an upper limit on the number of inlinks (nutchgora)
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1196
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1196
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1196.patch
>
>
> Currently the nutchgora branch does not limit the number of inlinks in the update job.
This will result in some nasty out-of-memory exceptions and timeouts when the crawl is getting
big. Nutch trunk already has a default limit of 10,000 inlinks. I will implement this in nutchgora
too. Nutch trunk uses a sorting mechanism in the reducer itself, but I will implement it using
standard Hadoop components instead (should be a bit faster). This means:
> The keys of the reducer will be a {url,score} tuple.
> *Partitioning* will be done by {url}.
> *Sorting* will be done by {url,score}.
> Finally *grouping* will be done by {url} again.
> This ensures all indentical urls will be put in the same reducer, but in order of scoring.
> Patch should be ready by tomorrow. Please let me know when you have any comments or suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message