nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadeem Douba (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1084) ReadDB url throws exception
Date Sat, 12 Sep 2015 07:16:46 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741933#comment-14741933
] 

Nadeem Douba edited comment on NUTCH-1084 at 9/12/15 7:15 AM:
--------------------------------------------------------------

I think I found the issue and I don't think it's related to Nutch. AbstractMapWritable uses
the Class.forName method which throws the CNFE. This is because Class.forName uses the system
class loader which is different than the current thread's class loader in that it does not
include the job jar as part of its class path. I recompiled hadoop-common to see if it would
fix the issue by replacing the Class.forName call with Thread.currentThread().getContextClassLoader().loadClass(class).
This seems to fix the issue. The bug report can be found here https://issues.apache.org/jira/browse/HADOOP-12406


was (Author: ndouba):
I think I found the issue and I don't think it's related to Nutch. AbstractMapWritable uses
the Class.forName method which throws the CNFE. This is because Class.forName uses the system
class loader which is different than the current thread's class loader in that it does not
include the job jar as part of its class path. I recompiled hadoop-common to see if it would
fix the issue by replacing the Class.forName call with Thread.currentThread().getContextClassLoader().loadClass(class).
This seems to fix the issue.

> ReadDB url throws exception
> ---------------------------
>
>                 Key: NUTCH-1084
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1084
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to write the
_SUCCESS file. Until now that's the solution implemented for similar issues. I've not been
successful as to make the Hadoop readers simply skip the file.
> The second issue seems a bit strange and did not happen on a local check out. I'm not
yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's
the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus
because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message