nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "julien nioche (JIRA)" <>
Subject [jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
Date Sat, 21 Feb 2009 01:08:01 GMT


julien nioche commented on NUTCH-692:

I have been investigating this a bit more. Same problem : some reduce tasks fail during the
parsing and when the mapred.task.timeout is reached the new tasks can't get a lease for the
files and we get the AlreadyBeingCreatedException. 

This is clearly a Hadoop issue; I have not tried with a previous version and don't know whether
this will be fixed in the 0.19.1 release. Could this be due to the fact that the RecordWriter
in ParseOutputFormat holds multiple Writers internally?

I had a look at the other side of the problem and found that for some documents the tasks
were blocking on : 

	at org.apache.oro.text.regex.Util.substitute(Unknown Source)
	at org.apache.nutch.parse.ParseOutputFormat$1.write(
	at org.apache.nutch.parse.ParseOutputFormat$1.write(

and the following regex used in the regexurlfilter 


I haven't dumped the  actual URLS in the logs but I suspect that they come from the JSParser.
I will remove both the regex-urlfilter and the BasicURLNormalizer and see what I get.


> AlreadyBeingCreatedException with Hadoop 0.19
> ---------------------------------------------
>                 Key: NUTCH-692
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
> I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException
during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran
into this AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues with Hadoop
0.19 (see I have not tried using
0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you think 0.19
is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of weeks and I
will confirm this issue  
> J.  

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message