nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 21:40:06 GMT
Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  before 
> the HTTP GET, and determine the mime-type before actually  grabbing the 
> content.
> It's not how Nutch works now, but this might be more useful than a  
> super-detailed set of regexes...

This could be a useful addition, but it could not replace url-based 
filters.  A HEAD request must still be polite, so this could 
substantially slow fetching, as it would incur more delays.  Also, for 
most dynamic pages, a HEAD is as expensive for the server as a GET, so 
this would cause more load on servers.


View raw message