nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sami Siren <>
Subject Re: Fetching problem and FileProtocol bug in Nutch 0.8.1
Date Tue, 12 Dec 2006 16:08:54 GMT
Armel T. Nene wrote:
> *	Nutch topN -  we have to set the amount of the pages that we want to
> fetch from a root url. When setting topN values Nutch will crawl and fetch
> the number of files given. Therefore, when updating the index (re-crawl)
> Nutch will go and take the same topN files in the directory. The problem
> arise when you just want to fetch a number of files at a time, therefore at
> the next crawl, the crawler only fetches new files from the directory or
> updates changes in the index. I think the problem with real-adaptive

You could find your new or modified files easier outside of nutch with
command like find <root> -mtime <nDays>|sed "s/^/file:\/\//" > fetchlist.txt

Which would generate a file of file urls found under <root> that are
last changed <nDays> ago.

You would then inject the list and generate,fetch,updatedb,index and search

Bootstrapping the url list would also happen with find without extra

> fetching feature in Nutch is not possible unless Nutch changes the way it
> indexes. By that I mean, when Nutch indexes after a crawl or a re-crawl,
> Nutch doesn't update the current index but creates a new index after each
> indexing. If Nutch has the ability to update its existing index, it will be

It would be interesting to experiment with real time indexing hooks in
fetcher so it could feed the content into Solr for example when it's hot.

> *	I am not entirely sure if this is a bug but here the issue: I have
> set Nutch on MS Windows Server 2003. I have several logical drive such as;
> C, D, E and etc. Nutch is set and running on drive D but when I try to crawl
> a directory from another drive it fails with FileProtcol error 404. I know
> error 404 is for file not found error code. I can crawl any directories from
> the drive where Nutch is installed. I tested it on different Windows server
> and drive but had the same error code. Can you let me know if that's a known
> bug or just a configuration issue from my part. Nutch works fine when it
> crawls any directory in its installed drive. 

Can't comment on that because I have no windows available.

 Sami Siren

View raw message