nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer
Date Sat, 28 Nov 2009 14:09:21 GMT


Hudson commented on NUTCH-761:

Integrated in Nutch-trunk #995 (See [])
    Fix a bug resulting from over-eager optimization in .
 Avoid cloning CrawlDatum in CrawlDbReducer.

> Avoid cloningCrawlDatum in CrawlDbReducer 
> ------------------------------------------
>                 Key: NUTCH-761
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>         Attachments: optiCrawlReducer.patch
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce
phase and these will be the entries coming from the crawlDB and not present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum
fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB
gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message