Hello Lewis,
1. counters, for me they are a requirement to have as they are key to regular inspections
of ongoing crawls, finding errors and debugging. I hope you can find a work around.
2. sounds interesting, but i'd like to see the test run with 12M rather than 12k URLs.
A question, are the produced files with Tez compatible with MapReduce programs, map and sequence
files? It would be a tremendous advantage if existing programs can work with it. It would
be a real pain to have to rewrite all code in one go. We have seen that lead to a dead end
many times, including our 2.x-branch.
Have a nice evening!
Markus
-----Original message-----
> From:Lewis John McGibbney <lewismc@apache.org>
> Sent: Monday 21st December 2020 21:40
> To: dev@nutch.apache.org
> Subject: Re: [DISCUSS] Replacing MapReduce with Tez
>
> Hi dev@,
> Short update here. I've documented my initial observations running Nutch on Tez at https://s.apache.org/viee3
> Specific early finding are as follows
> 1. Counters don't appear to work... which makes sense as all existing counters are manifested
using the MapReduce framework. I'm not sure if Tez has a similar/equivalent concept of counters
but I am working to find out more.
> 2. So far running some basic experiments using the Injector job on around ~12k URLs,
I've observed the following
> - When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the following runtimes
> * 1st run: elapsed: 00:00:42
> * 2nd run: elapsed: 00:00:13
> * 3rd run: elapsed: 00:00:14
>
> - When 'mapreduce.framework.name' is set to 'yarn' I am observing the following runtimes
> * 1st run: elapsed: 00:00:34
> * 2nd run: elapsed: 00:00:32
> * 3rd run: elapsed: 00:00:34
>
> So after the first run, it looks like running the Injector job on Tez results in a dramatic
runtime improvement.
>
> As I mentioned in the Tez thread, I'm going to document all of this on the Nutch wiki.
I also plan to continue my evaluation over the holidays and will report back here when I
have more information.
>
> Thanks
>
> On 2020/12/10 07:46:30, lewis john mcgibbney <lewismc@apache.org> wrote:
> > Hi dev@,
> > A while ago I had thought about bringing this topic up... I then got
> > busy... for ages. I'll therefore get straight to the point.
> > Has anyone on the dev@ team had an experience using Apache Tez -
> > tez.apache.org?
> > Tez promises multiple improvements over MapReduce. Naturally I wondered
> > whether the Nutch project is at a stage of maturity now that we would look
> > to leverage something more performant than legacy MapReduce.
> > Were we to consider evolving Nutch by re-architecting it to use Tez as the
> > processing engine, this would be a significant work effort.
> > I just wanted to throw this out there for some blue-sky feedback.
> > Thanks
> > lewismc
> >
> > --
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
> >
>
|