nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex
Date Wed, 15 Aug 2012 09:46:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434911#comment-13434911
] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

I still think it only leads to confusion. We also removed the -parse switch in favour of the
configuration option because only one of them would ever work.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove
the title and content fields from the parsed data. It does not stop those pages from being
indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message