nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <ogjunk-nu...@yahoo.com>
Subject Re: [Nutch-dev] HttpProtocol Plugin questions
Date Sat, 26 Mar 2005 23:42:26 GMT
Doug: replying to email sent to nutch-dev@incubator.apache.org sets To
to dev@nutch.org

Jérôme: Once can provide If-Modified-Since information in GET requests,
too.  I think that's preferable to HEAD, because with HEAD requests one
would have to first perform a HEAD request, and then another GET for
changed pages.  With the conditional GET request a single request is
all that's needed, as long as the If-Modified-Since request header is
provided.

Otis



--- Jérôme Charron <jerome.charron@gmail.com> wrote:
> Hello,
> 
> Looking at the HttpProtocol plugin code, I saw some ways of
> improvements, but not sure they are feasible:
> 
> 1. The HttpProtocol plugin always performs some GET methods. In my
> mind, a crawler designed to crawl to web (ie that will frequently
> update its index and documents) need to use the http HEAD method in
> order to know if the requested URL has been modified since the last
> crawl. Such implementation drastically reduce the needed band-width
> needed to update a set of documents: It only downloads the changed
> documents (I better understand why some messages on the list express
> the monthly needed band-width for a set of page as a constant value).
> I don't have yet enough Nutch knowledge to see what are the
> implications on the index/segments management, because I imagine that
> such a mechanism implies that we can:
> * preserve the previously fetched documents but not re-fetched due to
> a "Not-Changed" HEAD response.
> * delete a document that no more exist (Protocol plugin must be able
> to return to the nutch core a return code that distinguish between
> "document no more exist" and "document not changed since last time").
> 
> 2. I think the Http Pipelining could be a good way of performances
> improvements too. What do you think about it?
> 
> Thanks
> 
> 
> Jerome
> 
> 
> -- 
> http://motrech.free.fr/ - motrech [home]
> http://motrech.blogspot.com/ - motrech [blog]
> http://fr.groups.yahoo.com/group/motrech - motrech [liste]
> http://fr.groups.yahoo.com/group/frutch - frutch [liste]
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 

Mime
View raw message