nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Pavel (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-966) Behavior of NOINDEX,FOLLOW is not intuitive
Date Wed, 09 Feb 2011 14:32:57 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992502#comment-12992502
] 

Josh Pavel commented on NUTCH-966:
----------------------------------

A plugin that corrects the issue (again, thanks to Julien Nioche)

public class MetaNoIndexingFilter implements IndexingFilter {
    public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);

    private Configuration conf;

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        // should rely on doc or parse metadata but nothing stored
        // by the html parser
        String text = parse.getText();
        String title = parse.getData().getTitle();
        if ((text == null || text.equals(""))
                && (title == null || title.equals(""))) {
            // no text -> no indexing
            return null;
        }
        return doc;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public Configuration getConf() {
        return this.conf;
    }

}

> Behavior of NOINDEX,FOLLOW is not intuitive
> -------------------------------------------
>
>                 Key: NUTCH-966
>                 URL: https://issues.apache.org/jira/browse/NUTCH-966
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>    Affects Versions: 1.2
>            Reporter: Josh Pavel
>            Priority: Minor
>
> If a page has NOINDEX,FOLLOW for the ROBOTS metatag, Nutch will still create a document
that can be found in the index via metatag or URL matching.  Instead, Nutch should rely on
doc or parse metadata but nothing should be stored by the html parser. (thanks to Julien Nioche
for helping me to understand the issue). 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message