nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <tejas.patil...@gmail.com>
Subject Re: Inject operation: can't it be done in a single map-reduce job ?
Date Mon, 06 Jan 2014 22:55:15 GMT
Thanks Lewis and Markus.

@Lewis: I don't have a dedicated cluster (I am currently not a student nor
working anywhere) so would be running in the pseudo distributed mode on my
laptop. I don't think that it would be a perfect setup to get some stats.
Does ASF has any cluster which could be used ?

Thanks,
Tejas


On Mon, Jan 6, 2014 at 6:54 AM, Markus Jelsma <markus.jelsma@openindex.io>wrote:

> Hi - Yes, MultipleInputs works very well, i did that too when coding the
> HostDB. The MultipleInputs class was not available when the injector was
> originally written, it was introduced around 0.19 or 0.20. I see no reason
> not to replace this so +1 for an new ticket. If unit tests pass, we're good
> to go.
>
> -----Original message-----
> From: Lewis John Mcgibbney<lewis.mcgibbney@gmail.com>
> Sent: Monday 6th January 2014 15:40
> To: dev@nutch.apache.org
> Subject: Re: Inject operation: can't it be done in a single map-reduce job
> ?
>
> Hi Tejas,
>
> On Sat, Jan 4, 2014 at 8:01 AM,  <dev-digest-help@nutch.apache.org<mailto:
> dev-digest-help@nutch.apache.org>> wrote:
>
> I realized that by using MultipleInputs, we can read CrawlDatum objects
> from crawldb and urls from seeds file simultaneously and perform inject in
> a single map-reduce job. PFA Injector2.java which is an implementation of
> this approach. I did some basic testing on it and so far I have not
> encountered any problems.
>
> Dynamite Tejas. I would kindly ask that you open an issue and apply your
> patch against trunk :)
>
> I am not sure why Injector was not written this way which is more
> efficient than the one currently in trunk (maybe MultipleInputs was later
> added in Hadoop).
>
> As far as I have discovered, joins have been available in Hadoops mapred
> package and subsequently in mapreduce package so it may not be a case of
> them not being available... however this goes to no length to explain why
> the Injector was not written in this way.
>
> Wondering if I am wrong somewhere in my understanding. Any comments about
> this ?
>
> I am curious to discover how more efficient using the MultipleInputss
> class is over the sequential MR jobs as is currently implemented. Do you
> have any comparison on the size of the dataset being used?
>
> There is a script [0] I keep on my github which we can test this against
> (1M URLs). This would provide a reasonable input dataset which we could use
> to base some efficiency tests on.
>
> Great observations Tejas.
>
> Lewis
>
> [0] https://github.com/lewismc/nipt <https://github.com/lewismc/nipt>
>
>
>

Mime
View raw message