nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "julien nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
Date Sat, 21 Feb 2009 01:08:01 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675518#action_12675518
] 

julien nioche commented on NUTCH-692:
-------------------------------------

I have been investigating this a bit more. Same problem : some reduce tasks fail during the
parsing and when the mapred.task.timeout is reached the new tasks can't get a lease for the
files and we get the AlreadyBeingCreatedException. 

This is clearly a Hadoop issue; I have not tried with a previous version and don't know whether
this will be fixed in the 0.19.1 release. Could this be due to the fact that the RecordWriter
in ParseOutputFormat holds multiple Writers internally?

I had a look at the other side of the problem and found that for some documents the tasks
were blocking on : 

	at org.apache.oro.text.regex.Util.substitute(Unknown Source)
	at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.substituteUnnecessaryRelativePaths(BasicURLNormalizer.java:166)
	at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:125)
	at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
	at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:223)
	at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:114)

and the following regex used in the regexurlfilter 

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

I haven't dumped the  actual URLS in the logs but I suspect that they come from the JSParser.
I will remove both the regex-urlfilter and the BasicURLNormalizer and see what I get.

J.





> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>
>                 Key: NUTCH-692
>                 URL: https://issues.apache.org/jira/browse/NUTCH-692
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException
during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran
into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop
0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using
0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19
is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I
will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message