nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armel T. Nene" <armel.n...@idna-solutions.com>
Subject Fetching problem and FileProtocol bug in Nutch 0.8.1
Date Sun, 10 Dec 2006 21:16:00 GMT
Hi guys,

 

I have been writing patch and testing Nutch for the month, there are some
issues that I want to raise such as:

 

*	Nutch topN -  we have to set the amount of the pages that we want to
fetch from a root url. When setting topN values Nutch will crawl and fetch
the number of files given. Therefore, when updating the index (re-crawl)
Nutch will go and take the same topN files in the directory. The problem
arise when you just want to fetch a number of files at a time, therefore at
the next crawl, the crawler only fetches new files from the directory or
updates changes in the index. I think the problem with real-adaptive
fetching feature in Nutch is not possible unless Nutch changes the way it
indexes. By that I mean, when Nutch indexes after a crawl or a re-crawl,
Nutch doesn't update the current index but creates a new index after each
indexing. If Nutch has the ability to update its existing index, it will be
possible to set Nutch to crawl files a number at a time and update the index
as it goes. This will reduce the time it takes for Nutch to crawl larges
list of urls or directories. It is important to implement this feature but I
understand that it will not be possible to implement this feature probably
until version 1.2 of Nutch because of the changes that Nutch will have to
undergo. 

 

*	I am not entirely sure if this is a bug but here the issue: I have
set Nutch on MS Windows Server 2003. I have several logical drive such as;
C, D, E and etc. Nutch is set and running on drive D but when I try to crawl
a directory from another drive it fails with FileProtcol error 404. I know
error 404 is for file not found error code. I can crawl any directories from
the drive where Nutch is installed. I tested it on different Windows server
and drive but had the same error code. Can you let me know if that's a known
bug or just a configuration issue from my part. Nutch works fine when it
crawls any directory in its installed drive. 

 

Please help bring some lights on those issues.

 

Armel 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message