nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <>
Subject Inject operation: can't it be done in a single map-reduce job ?
Date Sat, 04 Jan 2014 08:00:47 GMT
Hi nutch-dev,

I am looking at Injector code in trunk and I see that currently we are
launching two map-reduce jobs for the same:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort
job. Merge and emit final CrawlDatum objects.

I realized that by using MultipleInputs, we can read CrawlDatum objects
from crawldb and urls from seeds file simultaneously and perform inject in
a single map-reduce job. PFA which is an implementation of
this approach. I did some basic testing on it and so far I have not
encountered any problems.

I am not sure why Injector was not written this way which is more efficient
than the one currently in trunk (maybe MultipleInputs was later added in
Hadoop). Wondering if I am wrong somewhere in my understanding. Any
comments about this ?


View raw message