nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: [DISCUSS] Replacing MapReduce with Tez
Date Mon, 21 Dec 2020 21:32:08 GMT
Hello Lewis,

1. counters, for me they are a requirement to have as they are key to regular inspections
of ongoing crawls, finding errors and debugging. I hope you can find a work around.

2. sounds interesting, but i'd like to see the test run with 12M rather than 12k URLs.

A question, are the produced files with Tez compatible with MapReduce programs, map and sequence
files? It would be a tremendous advantage if existing programs can work with it. It would
be a real pain to have to rewrite all code in one go. We have seen that lead to a dead end
many times, including our 2.x-branch.

Have a nice evening!
Markus

 
 
-----Original message-----
> From:Lewis John McGibbney <lewismc@apache.org>
> Sent: Monday 21st December 2020 21:40
> To: dev@nutch.apache.org
> Subject: Re: [DISCUSS] Replacing MapReduce with Tez
> 
> Hi dev@,
> Short update here. I've documented my initial observations running Nutch on Tez at https://s.apache.org/viee3
> Specific early finding are as follows
> 1. Counters don't appear to work... which makes sense as all existing counters are manifested
using the MapReduce framework. I'm not sure if Tez has a similar/equivalent concept of counters
but I am working to find out more.
> 2. So far running some basic experiments using the Injector job on around ~12k URLs,
I've observed the following
> - When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the following runtimes
>   * 1st run: elapsed: 00:00:42
>   * 2nd run: elapsed: 00:00:13
>   * 3rd run: elapsed: 00:00:14
> 
> - When 'mapreduce.framework.name' is set to 'yarn' I am observing the following runtimes
>   * 1st run: elapsed: 00:00:34
>   * 2nd run: elapsed: 00:00:32
>   * 3rd run: elapsed: 00:00:34
> 
> So after the first run, it looks like running the Injector job on Tez results in a dramatic
runtime improvement.
> 
> As I mentioned in the Tez thread, I'm going to document all of this on the Nutch wiki.
I also plan to  continue my evaluation over the holidays and will report back here when I
have more information. 
> 
> Thanks
> 
> On 2020/12/10 07:46:30, lewis john mcgibbney <lewismc@apache.org> wrote: 
> > Hi dev@,
> > A while ago I had thought about bringing this topic up... I then got
> > busy... for ages. I'll therefore get straight to the point.
> > Has anyone on the dev@ team had an experience using Apache Tez -
> > tez.apache.org?
> > Tez promises multiple improvements over MapReduce. Naturally I wondered
> > whether the Nutch project is at a stage of maturity now that we would look
> > to leverage something more performant than legacy MapReduce.
> > Were we to consider evolving Nutch by re-architecting it to use Tez as the
> > processing engine, this would be a significant work effort.
> > I just wanted to throw this out there for some blue-sky feedback.
> > Thanks
> > lewismc
> > 
> > -- 
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
> > 
> 

Mime
View raw message