nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex
Date Tue, 14 Aug 2012 12:56:37 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434093#comment-13434093
] 

Markus Jelsma commented on NUTCH-1434:
--------------------------------------

Hi Lewis - I haven't added the configuration because it's overridden by the command line switch
regardless of the nutch-site.xml configuration. The propery name can be seen in the IndexerMapReduce.java
patch:

+  public static final String INDEXER_DELETE_ROBOTS_NOINDEX = "indexer.delete.robots.noindex";

It's indeed not Solr because it's Solr agnostic.
                
> Indexer to delete robots noIndex
> --------------------------------
>
>                 Key: NUTCH-1434
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1434
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.5.1
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does is remove
the title and content fields from the parsed data. It does not stop those pages from being
indexed, nor can it delete existing pages from the index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message