nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <>
Subject Re: Inject operation: can't it be done in a single map-reduce job ?
Date Mon, 06 Jan 2014 22:55:15 GMT
Thanks Lewis and Markus.

@Lewis: I don't have a dedicated cluster (I am currently not a student nor
working anywhere) so would be running in the pseudo distributed mode on my
laptop. I don't think that it would be a perfect setup to get some stats.
Does ASF has any cluster which could be used ?


On Mon, Jan 6, 2014 at 6:54 AM, Markus Jelsma <>wrote:

> Hi - Yes, MultipleInputs works very well, i did that too when coding the
> HostDB. The MultipleInputs class was not available when the injector was
> originally written, it was introduced around 0.19 or 0.20. I see no reason
> not to replace this so +1 for an new ticket. If unit tests pass, we're good
> to go.
> -----Original message-----
> From: Lewis John Mcgibbney<>
> Sent: Monday 6th January 2014 15:40
> To:
> Subject: Re: Inject operation: can't it be done in a single map-reduce job
> ?
> Hi Tejas,
> On Sat, Jan 4, 2014 at 8:01 AM,  <<mailto:
>>> wrote:
> I realized that by using MultipleInputs, we can read CrawlDatum objects
> from crawldb and urls from seeds file simultaneously and perform inject in
> a single map-reduce job. PFA which is an implementation of
> this approach. I did some basic testing on it and so far I have not
> encountered any problems.
> Dynamite Tejas. I would kindly ask that you open an issue and apply your
> patch against trunk :)
> I am not sure why Injector was not written this way which is more
> efficient than the one currently in trunk (maybe MultipleInputs was later
> added in Hadoop).
> As far as I have discovered, joins have been available in Hadoops mapred
> package and subsequently in mapreduce package so it may not be a case of
> them not being available... however this goes to no length to explain why
> the Injector was not written in this way.
> Wondering if I am wrong somewhere in my understanding. Any comments about
> this ?
> I am curious to discover how more efficient using the MultipleInputss
> class is over the sequential MR jobs as is currently implemented. Do you
> have any comparison on the size of the dataset being used?
> There is a script [0] I keep on my github which we can test this against
> (1M URLs). This would provide a reasonable input dataset which we could use
> to base some efficiency tests on.
> Great observations Tejas.
> Lewis
> [0] <>

View raw message