manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Wed, 02 Feb 2011 15:45:00 GMT
On 28.01.11 14.32, Karl Wright wrote:
> Thanks.  I tested my changes enough so that I was confident in
> committing the patch, so the changes are in trunk.

I'm afraid that it doesn't work properly. I downloaded the latest 
version from trunk and started the crawler.

Try to use the following address in your seed list and the following 
rule in the includes list:
^http://ridder.uio.no/.*

The following document was fetched and sent to Solr for indexing even 
though it includes a robots noindex rule:
http://ridder.uio.no/test_closed/

Here's the line from the history telling me that Sole should index it:
02-02-2011 16:12:33.283 	document ingest (Solr) 
http://ridder.uio.no/test_closed/
	200

I can try to modify the code you have added in order to get around this 
tomorrow. I guess I can find the relevant check somewhere in the 
following folder?
mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message