nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
Date Fri, 17 Apr 2015 20:27:06 GMT
+1 please commit! Thanks seb 

Sent from my iPhone

> On Apr 17, 2015, at 4:15 PM, Sebastian Nagel (JIRA) <jira@apache.org> wrote:
> 
> 
>     [ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
> 
> Sebastian Nagel updated NUTCH-1927:
> -----------------------------------
>    Attachment: test_NUTCH-1927.2015-04-17.txt
>                NUTCH-1927.2015-04-17.patch
> 
> Patch to log more verbosely, here for a test on "localhost":
> {noformat}
> 2015-04-17 21:58:03,902 INFO  protocol.RobotRulesParser - Whitelisted hosts: [localhost]
> ...
> 2015-04-17 21:58:03,906 INFO  api.HttpRobotRulesParser - Whitelisted host found for:
http://localhost/foo/index.html
> 2015-04-17 21:58:03,906 INFO  api.HttpRobotRulesParser - Ignoring robots.txt for all
URLs from whitelisted host: localhost
> {noformat}
> 
> RobotsRuleParser now implements Tool to leverage testing: properties can be passed via
"-Dprop=val", see attached log from test session.
> 
>> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
>> ---------------------------------------------------------------------------
>> 
>>                Key: NUTCH-1927
>>                URL: https://issues.apache.org/jira/browse/NUTCH-1927
>>            Project: Nutch
>>         Issue Type: New Feature
>>         Components: fetcher
>>           Reporter: Chris A. Mattmann
>>           Assignee: Chris A. Mattmann
>>             Labels: available, patch
>>            Fix For: 1.10
>> 
>>        Attachments: NUTCH-1927.2015-04-16.patch, NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt,
NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, test_NUTCH-1927.2015-04-17.txt
>> 
>> 
>> Based on discussion on the dev list, to use Nutch for some security research valid
use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist:
>> {code:xml}
>> <property>
>>  <name>robot.rules.whitelist</name>
>>  <value>132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov</value>
>>  <description>Comma separated list of hostnames or IP addresses to ignore robot
rules parsing for.
>>  </description>
>> </property>
>> {code}
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Mime
View raw message