nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <>
Subject Re: [Nutch-dev] HttpProtocol Plugin questions
Date Sun, 27 Mar 2005 20:58:53 GMT
> Jérôme: Once can provide If-Modified-Since information in GET requests,
> too.  I think that's preferable to HEAD, because with HEAD requests one
> would have to first perform a HEAD request, and then another GET for
> changed pages.  With the conditional GET request a single request is
> all that's needed, as long as the If-Modified-Since request header is
> provided.
Yes, it's true. But the HEAD request could be useful if you want to
perform some filtering on HTTP headers. For instance, if you don't
want to download some resources for some content-types, you can
perform a HEAD request and cancel the operation if the content-type of
the HEAD response is not a content-type you want to index.
Moreover, if the code keeps the same connection to perform the two
requests (HEAD and GET), it will not really decrease performances.
A more complex support of the HEAD method could be to use it for
resources that are not modified frequently, and to uses a GET
(If-Modified-Since) request for resources that are frequently modified
(it implies that Nutch must keep an history of modifications!)

Once I finish to implement Mime-Magic support, I will perform some
tests of the HEAD method in the Http Plugin.


View raw message