nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1257) Support for the x-robots-tag HTTP Header
Date Tue, 05 Jan 2016 14:22:39 GMT


Markus Jelsma commented on NUTCH-1257:

Hmm, there is no patch but i remember having had this support on our older customized Nutch's.
Ill look if i can find it again.

> Support for the x-robots-tag HTTP Header
> ----------------------------------------
>                 Key: NUTCH-1257
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Mike
>            Assignee: Markus Jelsma
>              Labels: http,, privacy,, robots,
>             Fix For: 2.4
> Google and Bing both currently support the x-robots-tag HTTP header. This is important,
because they have a policy of not *crawling* links that are in a robots.txt file, and not
*indexing* links that are set to noindex. In the case that a page is indexed but not crawled,
Google and Bing will show the page in their results, but it will lack a snippet (since they
didn't crawl it, they can't generate one). 
> As a result, the only way to block Google and Bing from having a page in their index
is to use the robots meta tag in HTML pages and the x-robots-tag in other mimetypes.
> As a site owner that needs to keep specific pages private, I *cannot* trust robots.txt
to keep my pages out of Google and Bing, and I have to use the two robots standards. Since
Nutch doesn't support the HTTP header, I have to block it from crawling ALL non-HTML pages
on my site.
> This is not an ideal state of affairs, and it would be great if Nutch supported the x-robots-tag
HTTP header.
> I've done more research on this topic on my blog:
>  -
>  -

This message was sent by Atlassian JIRA

View raw message