nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x
Date Mon, 14 Sep 2015 22:02:45 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744364#comment-14744364
] 

Lewis John McGibbney commented on NUTCH-2097:
---------------------------------------------

Hi Folks,
After being hooked up via [~chrismattmann], I've just spoken with [~ndouba] on Skype. This
is really exciting work so I asked him to please log a Jira issue as a parent issue (which
he has done) and we can begin thinking about a Nutch 3.X branch.
The core work undertaken by Nadeem so far can be summarized as follows
 * Complete Ant + Ivy build system overhaul. (Non-back compatible)
 * Upgrade of all mapred- --> mapreduce API's in Nutch (Non-back compatible)
 * Complete refactoring of all IO (custom NutchWritable’s) into separate [IO package|https://github.com/allfro/nutch/tree/mr2-mvn/nutch-runtime/src/main/java/org/apache/nutch/io]
 * Complete refactoring of all Mapper functions into separate [mapper package|https://github.com/allfro/nutch/tree/mr2-mvn/nutch-runtime/src/main/java/org/apache/nutch/mapper]
Complete refactoring of all Reducer functions into separate [reducer package|https://github.com/allfro/nutch/tree/mr2-mvn/nutch-runtime/src/main/java/org/apache/nutch/reducer]
 * Introduction of [lib package|https://github.com/allfro/nutch/tree/mr2-mvn/nutch-runtime/src/main/java/org/apache/nutch/lib]
which contains all input and output formats.
 * Upgrade of Hadoop dependencies from 2.4.0 --> 2.7.1

The above package naming conventions of course are intended to provide synergy with Apache
Hadoop.

My thoughts are a follows: The work which has gone on in Nadeem's mr2-mvn branch are too wide
and cover too much of the Nutch 1.11-SNAPSHOT (as of commit r1697466 NUTCH-2049 Upgrade Trunk
to Hadoop > 2.4 stable) code base for us to back port them into Nutch trunk (1.11-SNAPSHOT).
Both Nadeem and myself therefore discussed and proposed that we forward port all commits (post
commit r1697466) to Nadeem's branch and propose this codebase as Nutch 3.X which will lessen
the burden on everyone. The burden can be defined as defining a patch for each tools, each
issue, and each change. That would be hellish. The former way as described above is a better
solution.

This issues should act as a parent for defining Nutch 3.X based off of Nutch 1.11-SNAPSHOT.

> Proposal for Nutch 3.x
> ----------------------
>
>                 Key: NUTCH-2097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2097
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.12
>            Reporter: Nadeem Douba
>            Assignee: Lewis John McGibbney
>
> This is a parent issue which contains a proposal for Nutch 3.x. It's based on my branch
(mr2-mvn at https://github.com/allfro/nutch).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message