nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Wang <binwang...@gmail.com>
Subject Independent Map Reduce to parse Nutch content (Cont.)
Date Sat, 04 Jan 2014 04:56:07 GMT
Hi,

I tried to modify the code here to parse the nutch content data...
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
And in the end of this email is a prototype that I have written to run map
reduce to calculate the HTML content length of each URL that I have scraped.

The mapper part runs perfectly fine as expected, however, the whole program
stops after all the mappers finished and the reducer did not get a chance
to run: (I am sure there are certain number of pages got scraped and in the
Eclipse console, there are same number of Mapper.. so I assume all the
mapper finished.)

Can anyone, who is pretty into writing java map reduce job take a look at
my code and see what the error might be... I am not a Java developer at all
so any debug trick or common sense will be appreciated!
(I heard that it is fairly hard to debug code written using hadoop API...
is that true?)

Many thanks!

/usr/bin

_____________________________________________________

Eclipse Console Info

Starting Mapper ...

Key: http://url1

Result: 134943

Starting Mapper ...

Key: http://url2

Result: 258588

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)

at arrow.ParseMapred.run(ParseMapred.java:68)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at arrow.ParseMapred.main(ParseMapred.java:18)

_____________________________________________________

// my code

package example;

import ...;

public class ParseMapred extends Configured implements Tool,

 Mapper<WritableComparable<?>, Content, Text, IntWritable>,

 Reducer<Text, IntWritable, Text, IntWritable> {


 public static void main(String[] args) throws Exception {

 int res = ToolRunner.run(NutchConfiguration.create(),

  new ParseMapred(), args);

 System.exit(res);

}


 public void configure(JobConf job) {

 setConf(job);

}


 public void close() throws IOException {}


 public void reduce(Text key, Iterator<IntWritable> values,

 OutputCollector<Text, IntWritable> output, Reporter reporter)

 throws IOException {

 System.out.println("Starting Reducer ...");

 System.out.println("Reducer: " + "key" + key);

    output.collect(key, values.next()); // collect first value

}


 public void map(WritableComparable<?> key, Content content,

 OutputCollector<Text, IntWritable> output, Reporter reporter)

 throws IOException {

 Text url = new Text();

 IntWritable result = new IntWritable();

 url.set("fail");

 result = new IntWritable(1);

 try {

 System.out.println("Starting Mapper ...");

 url.set(key.toString());

 result = new IntWritable(content.getContent().length);

 System.out.println("Key: " + url);

 System.out.println("Result: " + result);

 output.collect(url, result);

 } catch (Exception e) {

 // TODO Auto-generated catch block

 output.collect(url, result);

 }

}


 public int run(String[] args) throws Exception {

    JobConf job = new NutchJob(getConf());

    job.setJobName("ParseData");

    FileInputFormat.addInputPath(job, new Path("/Users/.../data/"));

    FileOutputFormat.setOutputPath(job, new Path("/Users/.../result"));

    job.setInputFormat(SequenceFileInputFormat.class);

    job.setOutputFormat(TextOutputFormat.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(ParseMapred.class);

    job.setReducerClass(ParseMapred.class);

    JobClient.runJob(job);

 return 0;

}

}

Mime
View raw message