nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armel T. Nene" <armel.n...@idna-solutions.com>
Subject Nutch site crawling
Date Thu, 07 Dec 2006 10:47:20 GMT
Hi,

 

Is it possible to let Nutch crawl a set of documents at a time?

 

I have set-up Nutch with the following option:

 

topN 20

 

depth 2

 

Therefore I wanted Nutch to crawl my directory and just as deep as 2 links
from the root directory. Now the root directory itself contains more than 20
files but my understanding of the topN is to make the crawler fetch 20
documents and then index. At the next crawl, the it chooses another 20 files
from the directory and fetches and indexex them.

 

My problem is that when Nutch crawls, it keeps on fetching the same files
over and over again. That is a severe issue in my case because I have to run
Nutch on some directory with more than 100 GB of data. It is more efficient
to crawl a small set of files at a time to index than try to fetch all the
data before indexing. Can you let me a workaround this? Or just let me know
what I am doing wrong. 

 

Thanks in advance.

 

Regards,

 

Armel


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message