nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <tejas.patil...@gmail.com>
Subject Re: Independent Map Reduce to parse Nutch content (Cont.)
Date Sun, 05 Jan 2014 03:56:26 GMT
*>> It will finish all the mappers without problem but still.. errored out
after all the mappers....*
*>> Exception in thread "main" java.io.IOException: Job failed!*

As I mentioned in the earlier mail, did you see the logs to find out the
root cause of the exception ?

*>> I can see Nutch constantly uses Hadoop API without hadoop
pre-installed.. why can't my code work*

The way you are running in local mode is using:
*>> java -jar example.jar localinput/ localoutput/*

which is not adding the hadoop jars and its dependency jars in classpath.
You could set Hadoops' configuration for local mode and then run it using
the same command you used for the distributed mode. ie.
*>> hadoop -jar example.jar hdfsinput/ hdfsoutput/*

The advantage of using this command is that it would set your classpath and
environment variables for you and then invoke the relevant java class. When
your config is tuned for local mode of hadoop, it would run locally just
like Nutch's local mode.

Thanks,
Tejas


On Sat, Jan 4, 2014 at 12:47 PM, Bin Wang <binwang.cu@gmail.com> wrote:

> Hi Tejas,
>
> I started an AWS instance and run hadoop in single node mode.
>
> When I do..
> hadoop -jar example.jar hdfsinput/ hdfsoutput/
>
> Everything works perfect as I expected: a bunch of staff got printed to
> the screen and both mappers and reducers got finished without question. In
> the end, the expected output sits in the hdfs output directory.
>
> However, when I tried to run the jar file without hadoop:
> java -jar example.jar localinput/ localoutput/
> It will finish all the mappers without problem but still.. errored out
> after all the mappers....
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
> at arrow.ParseMapred.run(ParseMapred.java:70)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at arrow.ParseMapred.main(ParseMapred.java:18)
>
> I am so confused now why my code doesn't work locally...
> Based on my understanding, I can see Nutch constantly uses Hadoop API
> without hadoop pre-installed.. why can't my code work..
>
> Well, any hint or directional guidance will be appreciated, many thanks!
>
> /usr/bin
>
>
>
>
> On Sat, Jan 4, 2014 at 12:38 AM, Tejas Patil <tejas.patil.cs@gmail.com>wrote:
>
>> Hi Bin Wang,
>> I would suggest you to NOT use eclipse and run your code over command
>> line. Use logger statements and see the logs for full stack traces of the
>> failure. In my personal experience, logs are the best way to debug hadoop
>> code compared to Eclipse debugger.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Fri, Jan 3, 2014 at 8:56 PM, Bin Wang <binwang.cu@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I tried to modify the code here to parse the nutch content data...
>>>
>>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
>>> And in the end of this email is a prototype that I have written to run
>>> map reduce to calculate the HTML content length of each URL that I have
>>> scraped.
>>>
>>> The mapper part runs perfectly fine as expected, however, the whole
>>> program stops after all the mappers finished and the reducer did not get a
>>> chance to run: (I am sure there are certain number of pages got scraped and
>>> in the Eclipse console, there are same number of Mapper.. so I assume all
>>> the mapper finished.)
>>>
>>> Can anyone, who is pretty into writing java map reduce job take a look
>>> at my code and see what the error might be... I am not a Java developer at
>>> all so any debug trick or common sense will be appreciated!
>>> (I heard that it is fairly hard to debug code written using hadoop
>>> API... is that true?)
>>>
>>> Many thanks!
>>>
>>> /usr/bin
>>>
>>> _____________________________________________________
>>>
>>> Eclipse Console Info
>>>
>>> Starting Mapper ...
>>>
>>> Key: http://url1
>>>
>>> Result: 134943
>>>
>>> Starting Mapper ...
>>>
>>> Key: http://url2
>>>
>>> Result: 258588
>>>
>>> Exception in thread "main" java.io.IOException: Job failed!
>>>
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
>>>
>>> at arrow.ParseMapred.run(ParseMapred.java:68)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>
>>> at arrow.ParseMapred.main(ParseMapred.java:18)
>>>
>>> _____________________________________________________
>>>
>>> // my code
>>>
>>> package example;
>>>
>>> import ...;
>>>
>>> public class ParseMapred extends Configured implements Tool,
>>>
>>>  Mapper<WritableComparable<?>, Content, Text, IntWritable>,
>>>
>>>  Reducer<Text, IntWritable, Text, IntWritable> {
>>>
>>>
>>>  public static void main(String[] args) throws Exception {
>>>
>>>  int res = ToolRunner.run(NutchConfiguration.create(),
>>>
>>>   new ParseMapred(), args);
>>>
>>>  System.exit(res);
>>>
>>> }
>>>
>>>
>>>  public void configure(JobConf job) {
>>>
>>>  setConf(job);
>>>
>>> }
>>>
>>>
>>>  public void close() throws IOException {}
>>>
>>>
>>>  public void reduce(Text key, Iterator<IntWritable> values,
>>>
>>>  OutputCollector<Text, IntWritable> output, Reporter reporter)
>>>
>>>  throws IOException {
>>>
>>>  System.out.println("Starting Reducer ...");
>>>
>>>  System.out.println("Reducer: " + "key" + key);
>>>
>>>     output.collect(key, values.next()); // collect first value
>>>
>>> }
>>>
>>>
>>>  public void map(WritableComparable<?> key, Content content,
>>>
>>>  OutputCollector<Text, IntWritable> output, Reporter reporter)
>>>
>>>  throws IOException {
>>>
>>>  Text url = new Text();
>>>
>>>  IntWritable result = new IntWritable();
>>>
>>>  url.set("fail");
>>>
>>>  result = new IntWritable(1);
>>>
>>>  try {
>>>
>>>  System.out.println("Starting Mapper ...");
>>>
>>>  url.set(key.toString());
>>>
>>>  result = new IntWritable(content.getContent().length);
>>>
>>>  System.out.println("Key: " + url);
>>>
>>>  System.out.println("Result: " + result);
>>>
>>>  output.collect(url, result);
>>>
>>>  } catch (Exception e) {
>>>
>>>  // TODO Auto-generated catch block
>>>
>>>  output.collect(url, result);
>>>
>>>  }
>>>
>>> }
>>>
>>>
>>>  public int run(String[] args) throws Exception {
>>>
>>>     JobConf job = new NutchJob(getConf());
>>>
>>>     job.setJobName("ParseData");
>>>
>>>     FileInputFormat.addInputPath(job, new Path("/Users/.../data/"));
>>>
>>>     FileOutputFormat.setOutputPath(job, new Path("/Users/.../result"));
>>>
>>>     job.setInputFormat(SequenceFileInputFormat.class);
>>>
>>>     job.setOutputFormat(TextOutputFormat.class);
>>>
>>>     job.setOutputKeyClass(Text.class);
>>>
>>>     job.setOutputValueClass(IntWritable.class);
>>>
>>>     job.setMapperClass(ParseMapred.class);
>>>
>>>     job.setReducerClass(ParseMapred.class);
>>>
>>>     JobClient.runJob(job);
>>>
>>>  return 0;
>>>
>>> }
>>>
>>> }
>>>
>>
>>
>

Mime
View raw message