nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viral Shah <viral.s...@metaweb.com>
Subject nutch fetch issue - empty content
Date Tue, 09 Sep 2008 23:54:29 GMT
Hello --

We are using Nutch to crawl html content for Wikipedia articles. We're  
using somewhat old nightly build version of nutch.


We use static list urls as an input. To do this we've injected our  
list of urls, set db.update.additions.allowed to false, and set the  
crawl depth to 1.
	
	- We iterate over the output segment files using  
'SequenceFile.Reader' and pullout the 'string' as well as 'binary'  
form of content.
	
		reader = SequenceFile.Reader(filesystem, Path(sys.argv[1]), job)
		key = reader.getKeyClass()()
   		content = reader.getValueClass()()
     		while reader.next(key, content):
			content_text = String(content.getContent(), "UTF-8").toString()
                   	content_binary = content.getContent()

	- I get empty content for some urls but the status in crawldb is set  
to 'db_fetched'.
		The value of content_text is "" and that of content_binary is  
array('b',[])

	- This is completely random in terms of when it happens and the urls  
involved.

	- This failure is completely silent as far as I can tell as nothing  
can be seen in logs regarding this error.


Again, we are crawling wikipedia which is verifiable for it's content  
and whether that content is accessible. We have tried manually getting  
the problem urls and everything looked fine.

Thank you,
Viral Shah
Mime
View raw message