nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <binwang...@gmail.com>
Subject How Map Reduce code in Nutch run in local mode vs distributed mode?
Date Fri, 03 Jan 2014 03:28:59 GMT
Hi there,

When I went through the source code of Nutch - the ParseSegment class,
which is the class to "parse content in a segment". Here is its map reduce
job configuration part.
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
 (Line
199 - 213)

199 JobConf job = new NutchJob(getConf());200 job.setJobName("parse " +
segment);201 202 FileInputFormat.addInputPath(job, new Path(segment,
Content.DIR_NAME));203 job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
204 job.setInputFormat(SequenceFileInputFormat.class);205job.setMapperClass(ParseSegment.class);
206 job.setReducerClass(ParseSegment.class);207
208FileOutputFormat.setOutputPath(job, segment);
209 job.setOutputFormat(ParseOutputFormat.class);210job.setOutputKeyClass(Text.class);
211 job.setOutputValueClass(ParseImpl.class);212 213 JobClient.runJob(job);
Here, in line 202 and line 208, the map reduce input/output path has been
configured by calling methods addInputPath/setOutputPath from
FileInputFormat.
And it is the absolute path in the Linux OS instead of HDFS virtual path.

And on the other hand, when I look at the WordCount example in the hadoop
homepage.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)

39.     JobConf conf = new JobConf(WordCount.class);40.
conf.setJobName("wordcount");41.42.     conf.setOutputKeyClass(Text.class);
43.     conf.setOutputValueClass(IntWritable.class);44.45.
conf.setMapperClass(Map.class);46.     conf.setCombinerClass(Reduce.class);
47.     conf.setReducerClass(Reduce.class);48.49.
conf.setInputFormat(TextInputFormat.class);50.
conf.setOutputFormat(TextOutputFormat.class);51.52.
FileInputFormat.setInputPaths(conf,
new Path(args[0]));53.     FileOutputFormat.setOutputPath(conf, new
Path(args[1]));54.55.     JobClient.runJob(conf);
Here, the input/output path was configured in the same way as Nutch but the
path was actually passed by passing the arguments.
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
And we can see the paths passed to the program are actually HDFS path..
 not Linux OS path..
I am confused here is there some other configuration that I missed which
lead to the run environment difference? In which case, should I pass
absolute or HDFS path?

Thanks a lot!

/usr/bin

Mime
View raw message