manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject web crawler job settings
Date Mon, 01 Jul 2013 13:56:17 GMT
Hi,

I am crawling main pages of some online newspaper web sites. 
I don't need deletes at all. I am using crawl once model.

Here is the settings I use : 

Schedule type:Scan every document once
Start Method : Start at beginning of schedule window

Scheduled time: Any day of week at 1 am 3 am 5 am 7 am 9 am 11 am 1 pm 3 pm 5 pm 7 pm 9 pm
11 pm plus 0 minutes
Maximum run time: No limit

Maximum hop count for link type 'link': 1
Maximum hop count for link type 'redirect': Unlimited
Hop count mode: No deletes, forever

Include only hosts matching seeds? yes
Seeds: A few URLs in the form of http://main.page.com/{category} where category is Sports,
Politics etc.

By setting hop count to 1 ( or 2) and 'no deletes, forever', I am expecting this crawl to
be super fast and most efficient. Minimal DB queries etc. Am I correct?

Thanks,
Ahmet

Mime
View raw message