nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
Date Tue, 18 Aug 2015 21:36:46 GMT


Lewis John McGibbney commented on NUTCH-1712:

[~tejasp] we are in the process of addressing NUTCH-2049, are you interested in rebasing off
of trunk and we can work to get this patch into trunk? 

> Use MultipleInputs in Injector to make it a single mapreduce job
> ----------------------------------------------------------------
>                 Key: NUTCH-1712
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.7
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>             Fix For: 1.11
>         Attachments: NUTCH-1712-trunk.v1.patch
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge
and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds
file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the unwanted
records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:

This message was sent by Atlassian JIRA

View raw message