nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
Date Sat, 16 Jun 2007 11:03:26 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505454
] 

Doğacan Güney commented on NUTCH-498:
-------------------------------------

> Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because
it uses a separate instance of Inlinks. Perhaps we could 
> replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

Sounds good. I opened NUTCH-499 for this.

> Use Combiner in LinkDb to increase speed of linkdb generation
> -------------------------------------------------------------
>
>                 Key: NUTCH-498
>                 URL: https://issues.apache.org/jira/browse/NUTCH-498
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 0.9.0
>            Reporter: Espen Amble Kolstad
>            Priority: Minor
>         Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch
>
>
> I tried to add the follwing combiner to LinkDb
>    public static enum Counters {COMBINED}
>    public static class LinkDbCombiner extends MapReduceBase implements Reducer {
>       private int _maxInlinks;
>       @Override
>       public void configure(JobConf job) {
>          super.configure(job);
>          _maxInlinks = job.getInt("db.max.inlinks", 10000);
>       }
>       public void reduce(WritableComparable key, Iterator values, OutputCollector output,
Reporter reporter) throws IOException {
>             final Inlinks inlinks = (Inlinks) values.next();
>             int combined = 0;
>             while (values.hasNext()) {
>                Inlinks val = (Inlinks) values.next();
>                for (Iterator it = val.iterator(); it.hasNext();) {
>                   if (inlinks.size() >= _maxInlinks) {
>                      if (combined > 0) {
>                         reporter.incrCounter(Counters.COMBINED, combined);
>                      }
>                      output.collect(key, inlinks);
>                      return;
>                   }
>                   Inlink in = (Inlink) it.next();
>                   inlinks.add(in);
>                }
>                combined++;
>             }
>             if (inlinks.size() == 0) {
>                return;
>             }
>             if (combined > 0) {
>                reporter.incrCounter(Counters.COMBINED, combined);
>             }
>             output.collect(key, inlinks);
>       }
>    }
> This greatly reduced the time it took to generate a new linkdb. In my case it reduced
the time by half.
> Map output records    8717810541
> Combined                  7632541507
> Resulting output rec 1085269034
> That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message