nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Diaa (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb
Date Mon, 12 May 2014 22:30:17 GMT


Diaa commented on NUTCH-1772:

Great idea! Should improve performance 
I suggest, IMHO, adding extra comments in the code for the reduction step since it wasn't
obvious to me what it does until I read the description here. Something like:

 if (dbExists) //Postpone reducer to the merging step
 else //Reduce urls right away

Another suggestion is also adding error handling in case the job fails so that it cleans up
the tempdir.

> Injector does not need merging if no pre-existing crawldb
> ---------------------------------------------------------
>                 Key: NUTCH-1772
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>    Affects Versions: 1.8
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1772.patch
> The injector currently works as following : 
> * MapReduce job 1 - Mapper :  converts input lines into CrawlDatum objects with normalisation
and filtering
> * MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at this stage
> * MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + output of
previous job
> * MapReducer job 2 - Reducer : deduplication
> If there is no existing crawldb (which will often be the case at injection time) we don't
really need to do the second mapreduce job and could simply take the output of the MR job
#1 as CrawlDB provided that we do the deduplication as part of the reduce step.
> If there is a crawldb then the reduce step of the MR job #1 is not really needed and
we could have that step as map only.

This message was sent by Atlassian JIRA

View raw message