nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
Date Mon, 13 Apr 2015 15:24:13 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492506#comment-14492506
] 

Chris A. Mattmann commented on NUTCH-1927:
------------------------------------------

Thanks Lewis, and Seb, got it. Will fix the formatting. Seb:

bq. http.robot.rules.whitelist should be empty per default
Yep fixed this (and the surrounding code) in my latest patch. Will upload soon.

bq. the description says "hostnames or IP addresses" - is IP address white listing supported?
Yep, if a URL uses an IP, this would work fine. However, later it may not work since we aren't
resolving on the fly. Probably shouldn't I guess.
 
bq. instead of repeatedly splitting whitelisted hosts at ',' use conf.getStrings(...) to initially
fill the white list
ACK, will do.

bq. also the white list is a set and should be stored as such to avoid iterating over the
list as in isWhiteListed()
Meaning then to replace with contains or something?

bq. Why is it necessary to create in Fetcher for every URL a new WhiteListRobotRules object?
Wouldn't it be simpler (and more efficient) to use the existing cache in RobotRulesParser
and just put a reference to a singleton white list rules object if the host is element of
the white list?

Good idea, will do so. New patch coming soon! Are you at ApacheCon?

> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1927
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1927
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>              Labels: available, patch
>             Fix For: 1.10
>
>         Attachments: NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt
>
>
> Based on discussion on the dev list, to use Nutch for some security research valid use
cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist:
> {code:xml}
> <property>
>   <name>robot.rules.whitelist</name>
>   <value>132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore robot
rules parsing for.
>   </description>
> </property>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message