manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Thu, 27 Jan 2011 15:16:30 GMT
Sure, please open a ticket.
Interpreting the tag should not be difficult.  The main issues will be
around noting the crawler's decision to skip documents or content in
the activities history.  And, of course, this will not be available in
the ManifoldCF-0.1-incubating release.

Please specify what variants of the tag you think should be supported,
and if supported, how you think it should work.  For example,
including "nofollow" does not usually block crawlers from reaching
your linked documents from other directions; if you want that
functionality, you probably won't find that anywhere.  This is why
most people use robots.txt rather than the meta tag.

Karl


On Thu, Jan 27, 2011 at 10:04 AM, Erlend GarĂ¥sen
<e.f.garasen@usit.uio.no> wrote:
>
> I just figured out that the web crawler does not follow the rules defined by
> the robots meta tag. I created a document with the following tag:
> <meta name="robots" content="noindex, nofollow">
>
> This document has also a link to another document in order to test the
> "nofollow" rule, but both documents were fetched and indexed by Solr.
>
> Should I open a Jira issue about this? I hope it's easy to rewrite the
> crawler in order to add this functionality since this is a blocker for us.
>
> Erlend
>
> --
> Erlend GarĂ¥sen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message