nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
Date Thu, 25 Feb 2016 18:46:18 GMT


ASF GitHub Bot commented on NUTCH-1712:

GitHub user sebastian-nagel reopened a pull request:

    NUTCH-1712 Injector to use MultipleInputs (new MR API)

    Tested inject in combination with other CrawlDb tools (readdb, updatedb, mergedb): everything
seems to work smoothly, although output files are part-00000 or part-r-00000 (for old resp.
new MapReduce API).

You can merge this pull request into a Git repository by running:

    $ git pull NUTCH-1712

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #86
commit 8900e4fb8b417f1f1e46f4dcb6c02840d2a5b838
Author: Sebastian Nagel <>
Date:   2015-10-19T19:48:05Z

    NUTCH-1712 applied to current trunk; run first simple tests (inject + merge)

commit 11942a92bd583eca8253e2b34f259f74c0ae4b81
Author: Sebastian Nagel <>
Date:   2016-01-17T20:32:31Z

    add unit tests based on MRUnit

commit 712b0b0ca2883fa399e23f7f22c9ffc236ec3db4
Author: Sebastian Nagel <>
Date:   2016-01-17T21:20:32Z

    update tests to reflect change of reduce outputs by new API (part-nnnnn -> part-r-nnnnn):
all unit tests pass now


> Use MultipleInputs in Injector to make it a single mapreduce job
> ----------------------------------------------------------------
>                 Key: NUTCH-1712
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.7
>            Reporter: Tejas Patil
>            Assignee: Sebastian Nagel
>         Attachments: NUTCH-1712-trunk.v1.patch
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge
and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds
file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the unwanted
records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:

This message was sent by Atlassian JIRA

View raw message