nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0
Date Thu, 12 Jun 2008 22:32:45 GMT


Andrzej Bialecki  commented on NUTCH-634:

The attached diff is not a valid patch created with 'svn diff'. Please create a patch using
'svn diff', from the top of the source tree of Nutch trunk/.

I'm not sure whether the FileOnlySequenceFileOutputFormat is the right answer to the problem
of _logs directories ... I think the existence of these directories is caused by a setting
in Hadoop contiguration, hadoop.job.history.user.location, which defaults to the output directory
(which sounds awfully strange to me to use this as a default!). Further investigation is needed
before we mess up things on our side. ;)

The code formatting on these two new files and in some other places doesn't conform to the
Nutch formatting, which is basically the Sun style with 2 space indents. Please note also
that you use different curly brace placement than the Sun style advises.

Generics on the CrawlDbReducer are too general, instead of

bq. implements Reducer<WritableComparable,Writable,WritableComparable,Writable>

it should be

bq. implements Reducer<Text, CrawlDatum, Text, CrawlDatum>

Similar tightening should be done in other places where you added generics.

The CrawlDatum.shallowCopy() method is dangerous IMHO - newly created copies still contain
references to the same metaData instance, which may be modified any time by the framework
as you iterate through the input items. We should do a deep clone using WritableUtils.clone().

IndexDoc.copyConstructor() should be replaced by a deep clone().

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>                 Key: NUTCH-634
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki 
>             Fix For: 0.9.0
>         Attachments: diff, hadoop-0.17.patch
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge
w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but
it might not. 
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit
tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message