nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <tejas.patil...@gmail.com>
Subject Re: How Map Reduce code in Nutch run in local mode vs distributed mode?
Date Fri, 03 Jan 2014 18:52:09 GMT
In local mode, the hadoop jars in the classpath (see
runtime/local/lib.hadoop-core-1.2.0.jar) of nutch jobs. From the hadoops'
FileSytem class (see line 132 in [0]), the default value of 'fs.default.name'
is picked up by code.

[0] :
http://svn.apache.org/viewvc/hadoop/common/branches/branch-1.2/src/core/org/apache/hadoop/fs/FileSystem.java?view=markup

On Fri, Jan 3, 2014 at 10:22 AM, Bin Wang <binwang.cu@gmail.com> wrote:

> Hi Tejas,
>
> Thanks a lot for your response, now I completely understand how WordCount
> example read path as HDFS path because you use `hadoop` command to call the
> WordCount.jar. And `hadoop` configuration says:
> <configuration>
>      <property>
>          <name>fs.default.name</name>
>          <value>hdfs://localhost:9000</value>
>      </property>
> </configuration>
> ...
>
> However, Nutch 1.7 can be installed without Hadoop preinstalled. Where
> does Nutch read the filesystem configuration? there is no core-site.xml for
> Nutch.
> Isn't it? then it is default as local ?
>
> /usr/bin
>
>
>
>
> On Thu, Jan 2, 2014 at 10:02 PM, Tejas Patil <tejas.patil.cs@gmail.com>wrote:
>
>> The config 'fs.default.name' of core-site.xml is what makes this happen.
>> Its default value is "file:///" which corresponds to local mode of Hadoop.
>> In local mode Hadoop looks for paths on the local file system. In
>> distributed mode of Hadoop, 'fs.default.name' would be
>> "hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang <binwang.cu@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> When I went through the source code of Nutch - the ParseSegment class,
>>> which is the class to "parse content in a segment". Here is its map reduce
>>> job configuration part.
>>>
>>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
 (Line
>>> 199 - 213)
>>>
>>> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse "
>>> + segment); 201  202 FileInputFormat.addInputPath(job, new
>>> Path(segment, Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
>>> segment.getName()); 204
>>> job.setInputFormat(SequenceFileInputFormat.class); 205
>>> job.setMapperClass(ParseSegment.class); 206
>>> job.setReducerClass(ParseSegment.class); 207  208 FileOutputFormat.setOutputPath(job,
>>> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
>>> job.setOutputKeyClass(Text.class); 211
>>> job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
>>>
>>> Here, in line 202 and line 208, the map reduce input/output path has
>>> been configured by calling methods addInputPath/setOutputPath from
>>> FileInputFormat.
>>> And it is the absolute path in the Linux OS instead of HDFS virtual
>>> path.
>>>
>>> And on the other hand, when I look at the WordCount example in the
>>> hadoop homepage.
>>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 -
>>> 55)
>>>
>>> 39.      JobConf conf = new JobConf(WordCount.class); 40.
>>> conf.setJobName("wordcount"); 41. 42.
>>> conf.setOutputKeyClass(Text.class); 43.
>>> conf.setOutputValueClass(IntWritable.class); 44. 45.
>>> conf.setMapperClass(Map.class); 46.
>>> conf.setCombinerClass(Reduce.class); 47.
>>> conf.setReducerClass(Reduce.class); 48. 49.
>>> conf.setInputFormat(TextInputFormat.class); 50.
>>> conf.setOutputFormat(TextOutputFormat.class); 51. 52.     FileInputFormat.setInputPaths(conf,
>>> new Path(args[0])); 53.      FileOutputFormat.setOutputPath(conf, new
>>> Path(args[1])); 54. 55.     JobClient.runJob(conf);
>>> Here, the input/output path was configured in the same way as Nutch but
>>> the path was actually passed by passing the arguments.
>>> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
>>> /usr/joe/wordcount/input /usr/joe/wordcount/output
>>> And we can see the paths passed to the program are actually HDFS path..
>>>  not Linux OS path..
>>> I am confused here is there some other configuration that I missed which
>>> lead to the run environment difference? In which case, should I pass
>>> absolute or HDFS path?
>>>
>>> Thanks a lot!
>>>
>>> /usr/bin
>>>
>>>
>>
>

Mime
View raw message