nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
Date Thu, 16 Apr 2015 02:32:59 GMT


Chris A. Mattmann commented on NUTCH-1927:

Hi Seb!


bq. Hi Chris, the class WhiteListRobotRules seems to me still overly complex. It should be
possible to keep the cache as is and only put a reference to light-weight singleton RobotRules
object (such as created by the default constructor of the WhiteListRobotRules) in case a host
is whitelisted.

I don't understand this. Can you please reply with code? For example, WhiteListRobotRules
*does* in fact simply store a singleton reference to a RobotRules object, under the premises
for which it's constructed (no longer in the Fetcher but really only in the Protocol Layers
by way of the RobotRulesParser base class). I did add a constructor for constructing blank
WhiteListRobotRules in which it does construct a new default RobotRules instance - is that
what you are objecting to? Do you want me to remove the constructor that takes no parameters?

bq. Also I do not understand why getCrawlDelay() needs to store the last URL: the Crawl-Delay
specified in the robots.txt can be used to override the default delay/interval when a robot/crawler
accesses the same host successively: it's a fixed value and does not depend on any previous

Right - and all I'm doing is to ensure that when it's first called in when
it's going to get a WhiteListRobotsRule Decorator from the CACHE, that in
(where it doesn't pass the URL again) that it remembers the URL that it was constructed with
(when it was created in the cache in in my patch).

bq. Don't know whether this is a problem: we (almost) everywhere use org.slf4j.Logger and
not java.util.logging.Logger.

Happy to change this.

So, new patch to change to sl4fj Logger; other than that we OK?

> Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
> ---------------------------------------------------------------------------
>                 Key: NUTCH-1927
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>              Labels: available, patch
>             Fix For: 1.10
>         Attachments: NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt,
> Based on discussion on the dev list, to use Nutch for some security research valid use
cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist:
> {code:xml}
> <property>
>   <name>robot.rules.whitelist</name>
>   <value>,,</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore robot
rules parsing for.
>   </description>
> </property>
> {code}

This message was sent by Atlassian JIRA

View raw message