nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Option to disable Robots Rule checking
Date Tue, 27 Jan 2015 22:42:16 GMT
Hey Guys,

I’ve recently been made aware of some situations in which
we are using crawlers like Nutch and we explicitly are looking
not to honor robots.txt (some for research purposes; some for
other purposes). Right now, of course, this isn’t possible since
it’s always explicitly required.

What would you guys think of as an optional configuration (turned
off by default) that allows bypassing of Robot rules?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




Mime
View raw message