nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
Date Fri, 15 Jun 2007 07:58:26 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505079
] 

Enis Soztutar commented on NUTCH-498:
-------------------------------------

I think you may not want 
{code} 
reporter.incrCounter(Counters.COMBINED, combined); 
{code}

which increments the counter by the total count so far, but rather you may use 
{code} 
reporter.incrCounter(Counters.COMBINED, 1); 
{code}
for each url combined. 

Could you make attach the patch against current trunk, so that we can apply it directly. 


> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output,
Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced
the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message