nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex
Date Wed, 15 Aug 2012 12:03:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435000#comment-13435000
] 

Ferdy Galema commented on NUTCH-1434:
-------------------------------------

+1 for removing commandline args and using configuration. (I actually like to see this done
for many more tools, as this allows for the greatest flexibility, but that is another discussion.)
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch, NUTCH-1434-1.6-3.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove
the title and content fields from the parsed data. It does not stop those pages from being
indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message