nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <tejas.patil...@gmail.com>
Subject Re: How Map Reduce code in Nutch run in local mode vs distributed mode?
Date Fri, 03 Jan 2014 05:02:15 GMT
The config 'fs.default.name' of core-site.xml is what makes this happen.
Its default value is "file:///" which corresponds to local mode of Hadoop.
In local mode Hadoop looks for paths on the local file system. In
distributed mode of Hadoop, 'fs.default.name' would be
"hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS.

Thanks,
Tejas


On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang <binwang.cu@gmail.com> wrote:

> Hi there,
>
> When I went through the source code of Nutch - the ParseSegment class,
> which is the class to "parse content in a segment". Here is its map reduce
> job configuration part.
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
 (Line
> 199 - 213)
>
> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse " +
> segment); 201  202 FileInputFormat.addInputPath(job, new Path(segment,
> Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
> segment.getName()); 204 job.setInputFormat(SequenceFileInputFormat.class);
> 205 job.setMapperClass(ParseSegment.class); 206
> job.setReducerClass(ParseSegment.class); 207  208 FileOutputFormat.setOutputPath(job,
> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
> job.setOutputKeyClass(Text.class); 211
> job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
> Here, in line 202 and line 208, the map reduce input/output path has been
> configured by calling methods addInputPath/setOutputPath from
> FileInputFormat.
> And it is the absolute path in the Linux OS instead of HDFS virtual path.
>
> And on the other hand, when I look at the WordCount example in the hadoop
> homepage.
> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)
>
> 39.      JobConf conf = new JobConf(WordCount.class); 40.
> conf.setJobName("wordcount"); 41. 42.
> conf.setOutputKeyClass(Text.class); 43.
> conf.setOutputValueClass(IntWritable.class); 44. 45.
> conf.setMapperClass(Map.class); 46.
> conf.setCombinerClass(Reduce.class); 47.
> conf.setReducerClass(Reduce.class); 48. 49.
> conf.setInputFormat(TextInputFormat.class); 50.
> conf.setOutputFormat(TextOutputFormat.class); 51. 52.     FileInputFormat.setInputPaths(conf,
> new Path(args[0])); 53.      FileOutputFormat.setOutputPath(conf, new
> Path(args[1])); 54. 55.     JobClient.runJob(conf);
> Here, the input/output path was configured in the same way as Nutch but
> the path was actually passed by passing the arguments.
> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
> /usr/joe/wordcount/input /usr/joe/wordcount/output
> And we can see the paths passed to the program are actually HDFS path..
>  not Linux OS path..
> I am confused here is there some other configuration that I missed which
> lead to the run environment difference? In which case, should I pass
> absolute or HDFS path?
>
> Thanks a lot!
>
> /usr/bin
>
>

Mime
View raw message