nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kangas <>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 22:23:08 GMT
Totally agreed. Neither approach replaces the other. I just wanted to  
mention possibility so people don't over-focus on trying to build a  
hyper-optimized regex list. :)

For the content provider, an HTTP HEAD request saves them bandwidth  
if we don't do a GET. That's some cost savings for them over doing a  
blind fetch (esp. if we discard it).

I guess the question is, what's worse:
- two server hits when we find content we want?, or
- spending bandwidth on pages that the Nutch installation will ignore  


On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:

> Matt Kangas wrote:
>> The latter is not strictly true. Nutch could issue an HTTP HEAD   
>> before the HTTP GET, and determine the mime-type before actually   
>> grabbing the content.
>> It's not how Nutch works now, but this might be more useful than  
>> a  super-detailed set of regexes...
> This could be a useful addition, but it could not replace url-based  
> filters.  A HEAD request must still be polite, so this could  
> substantially slow fetching, as it would incur more delays.  Also,  
> for most dynamic pages, a HEAD is as expensive for the server as a  
> GET, so this would cause more load on servers.
> Doug

Matt Kangas /

View raw message