nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John McGibbney <>
Subject Re: [DISCUSS] Replacing MapReduce with Tez
Date Mon, 21 Dec 2020 20:40:02 GMT
Hi dev@,
Short update here. I've documented my initial observations running Nutch on Tez at
Specific early finding are as follows
1. Counters don't appear to work... which makes sense as all existing counters are manifested
using the MapReduce framework. I'm not sure if Tez has a similar/equivalent concept of counters
but I am working to find out more.
2. So far running some basic experiments using the Injector job on around ~12k URLs, I've
observed the following
- When '' is set to 'yarn-tez' I am observing the following runtimes
  * 1st run: elapsed: 00:00:42
  * 2nd run: elapsed: 00:00:13
  * 3rd run: elapsed: 00:00:14

- When '' is set to 'yarn' I am observing the following runtimes
  * 1st run: elapsed: 00:00:34
  * 2nd run: elapsed: 00:00:32
  * 3rd run: elapsed: 00:00:34

So after the first run, it looks like running the Injector job on Tez results in a dramatic
runtime improvement.

As I mentioned in the Tez thread, I'm going to document all of this on the Nutch wiki. I also
plan to  continue my evaluation over the holidays and will report back here when I have more


On 2020/12/10 07:46:30, lewis john mcgibbney <> wrote: 
> Hi dev@,
> A while ago I had thought about bringing this topic up... I then got
> busy... for ages. I'll therefore get straight to the point.
> Has anyone on the dev@ team had an experience using Apache Tez -
> Tez promises multiple improvements over MapReduce. Naturally I wondered
> whether the Nutch project is at a stage of maturity now that we would look
> to leverage something more performant than legacy MapReduce.
> Were we to consider evolving Nutch by re-architecting it to use Tez as the
> processing engine, this would be a significant work effort.
> I just wanted to throw this out there for some blue-sky feedback.
> Thanks
> lewismc
> -- 

View raw message