manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Crawling new/updated files using Windows share connection takes too long
Date Fri, 18 Jan 2013 10:27:33 GMT
Hello


I would like some advice to improve crawling time of new/updated files
using Windows share connection.

I crawl file in Windows server and index them into Solr.

Currently, the second crawling of two hundred thousands files takes  over 5
hours, even though any files are not updated, created, deleted.

I assume MCF does the following processes (let me know if I am wrong)

- obtain updated time of a file
- compare the updated time with the one MCF obtained last time crawling(
probably stored in DB)
- if they are different MCF recognizes the file is to be indexed.

If the above processes are done for two thousands files, what part of the
processes could take time the most? obtaining updated time? reading data
from DB? what could be done to increase the crawling time do you think?

Please give me some advice.


Regards,

Shigeki

Mime
View raw message