nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John McGibbney <lewi...@apache.org>
Subject Re: RE: [DISCUSS] Replacing MapReduce with Tez
Date Tue, 22 Dec 2020 03:59:26 GMT
Hi Markus,
Thanks for chiming in :)
My responses below

On 2020/12/21 21:32:08, Markus Jelsma <markus.jelsma@openindex.io> wrote: 
> Hello Lewis,
> 
> 1. counters, for me they are a requirement to have as they are key to regular inspections
of ongoing crawls, finding errors and debugging. I hope you can find a work around.

I totally agree. Please see the observed issues I documented at https://cwiki.apache.org/confluence/display/NUTCH/Running+Nutch+on+Tez#RunningNutchonTez-ObservedIssues

> 
> 2. sounds interesting, but i'd like to see the test run with 12M rather than 12k URLs.

Please see https://cwiki.apache.org/confluence/display/NUTCH/Running+Nutch+on+Tez#RunningNutchonTez-RunningtheInjectorjobonTez

> 
> A question, are the produced files with Tez compatible with MapReduce programs, map and
sequence files?

Having consulted with the Tez Committers (https://s.apache.org/aiw8o) it appears that there
may be some unpopular MapReduce features which are not supported by Tez yet however I have
yet to encounter any issues along those lines.

> It would be a tremendous advantage if existing programs can work with it. 

I agree... so far the results look promising.

> It would be a real pain to have to rewrite all code in one go. We have seen that lead
to a dead end many times, including our 2.x-branch.

Yes I'm intrigued to see how things progress. Although I am still not sure 100% on what code
re-writing would be required. I am still learning more about how our MapReduce jobs would
be natively written using the Tez DAG API.

Mime
View raw message