nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Re: [Nutch-dev] HttpProtocol Plugin questions
Date Sat, 26 Mar 2005 23:42:26 GMT
Doug: replying to email sent to sets To

Jérôme: Once can provide If-Modified-Since information in GET requests,
too.  I think that's preferable to HEAD, because with HEAD requests one
would have to first perform a HEAD request, and then another GET for
changed pages.  With the conditional GET request a single request is
all that's needed, as long as the If-Modified-Since request header is


--- Jérôme Charron <> wrote:
> Hello,
> Looking at the HttpProtocol plugin code, I saw some ways of
> improvements, but not sure they are feasible:
> 1. The HttpProtocol plugin always performs some GET methods. In my
> mind, a crawler designed to crawl to web (ie that will frequently
> update its index and documents) need to use the http HEAD method in
> order to know if the requested URL has been modified since the last
> crawl. Such implementation drastically reduce the needed band-width
> needed to update a set of documents: It only downloads the changed
> documents (I better understand why some messages on the list express
> the monthly needed band-width for a set of page as a constant value).
> I don't have yet enough Nutch knowledge to see what are the
> implications on the index/segments management, because I imagine that
> such a mechanism implies that we can:
> * preserve the previously fetched documents but not re-fetched due to
> a "Not-Changed" HEAD response.
> * delete a document that no more exist (Protocol plugin must be able
> to return to the nutch core a return code that distinguish between
> "document no more exist" and "document not changed since last time").
> 2. I think the Http Pipelining could be a good way of performances
> improvements too. What do you think about it?
> Thanks
> Jerome
> -- 
> - motrech [home]
> - motrech [blog]
> - motrech [liste]
> - frutch [liste]
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real
> users.
> Discover which products truly live up to the hype. Start reading now.
> _______________________________________________
> Nutch-developers mailing list

View raw message