nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Re: Inject operation: can't it be done in a single map-reduce job ?
Date Mon, 06 Jan 2014 14:39:36 GMT
Hi Tejas,

On Sat, Jan 4, 2014 at 8:01 AM, <> wrote:

> I realized that by using MultipleInputs, we can read CrawlDatum objects
> from crawldb and urls from seeds file simultaneously and perform inject in
> a single map-reduce job. PFA which is an implementation of
> this approach. I did some basic testing on it and so far I have not
> encountered any problems.

Dynamite Tejas. I would kindly ask that you open an issue and apply your
patch against trunk :)

> I am not sure why Injector was not written this way which is more
> efficient than the one currently in trunk (maybe MultipleInputs was later
> added in Hadoop).

As far as I have discovered, join's have been available in Hadoop's mapred
package and subsequently in mapreduce package so it may not be a case of
them not being available... however this goes to no length to explain why
the Injector was not written in this way.

Wondering if I am wrong somewhere in my understanding. Any comments about
> this ?
> I am curious to discover how more efficient using the MultipleInputs's
class is over the sequential MR jobs as is currently implemented. Do you
have any comparison on the size of the dataset being used?

There is a script [0] I keep on my github which we can test this against
(1M URLs). This would provide a reasonable input dataset which we could use
to base some efficiency tests on.

Great observations Tejas.


View raw message