nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist
Date Wed, 20 May 2015 22:44:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553264#comment-14553264
] 

Chris A. Mattmann commented on NUTCH-1995:
------------------------------------------

Hey Seb, yeah I don't think we should support *, for sure. At the same time, turning off robots.txt
was as easy before as literally commenting out two lines, and typing ant runtime. We shouldn't
fool ourselves that we are preventing anything still, even with whitelisting. The preceding
can still be done (and I know of many, many situations, valid use cases for security use cases,
in which it is). Like I also said before, all we are doing in those cases is encouraging people
to fork and build their own crawlers, and call it != Nutch. I don't think we want that. I
personally don't want that. Also I'm trying to encourage more and more people in that domain
to use Nutch - whereas they've gone off and either built their own; modified Nutch with a
2 line patch; rebuilt it and called it something else, and/or used Scrapy. All of those are
not ideal solutions IMO.

So, back to the point. Let's check for "*", and of course not support that. But the other
ones, *.blah.*, */* blah, whatever, let's support those. Is that a fair compromise?


> Add support for wildcard to http.robot.rules.whitelist
> ------------------------------------------------------
>
>                 Key: NUTCH-1995
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1995
>             Project: Nutch
>          Issue Type: Improvement
>          Components: robots
>    Affects Versions: 1.10
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>         Attachments: NUTCH-1995.patch
>
>
> The {{http.robot.rules.whitelist}} ([NUTCH-1927|https://issues.apache.org/jira/browse/NUTCH-1927])
configuration parameter allows to specify a comma separated list of hostnames or IP addresses
to ignore robot rules parsing for.
> Adding support for wildcard in {{http.robot.rules.whitelist}} could be very useful and
simplify the configuration, for example, if we need to give many hostnames/addresses. Here
is an example:
> {noformat}
> <name>http.robot.rules.whitelist</name>
>   <value>*.sample.com</value>
>   <description>Comma separated list of hostnames or IP addresses to ignore 
>   robot rules parsing for. Use with care and only if you are explicitly
>   allowed by the site owner to ignore the site's robots.txt!
>   </description>
> </property>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message