nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <>
Subject HttpProtocol Plugin questions
Date Sat, 26 Mar 2005 14:31:37 GMT

Looking at the HttpProtocol plugin code, I saw some ways of
improvements, but not sure they are feasible:

1. The HttpProtocol plugin always performs some GET methods. In my
mind, a crawler designed to crawl to web (ie that will frequently
update its index and documents) need to use the http HEAD method in
order to know if the requested URL has been modified since the last
crawl. Such implementation drastically reduce the needed band-width
needed to update a set of documents: It only downloads the changed
documents (I better understand why some messages on the list express
the monthly needed band-width for a set of page as a constant value).
I don't have yet enough Nutch knowledge to see what are the
implications on the index/segments management, because I imagine that
such a mechanism implies that we can:
* preserve the previously fetched documents but not re-fetched due to
a "Not-Changed" HEAD response.
* delete a document that no more exist (Protocol plugin must be able
to return to the nutch core a return code that distinguish between
"document no more exist" and "document not changed since last time").

2. I think the Http Pipelining could be a good way of performances
improvements too. What do you think about it?



-- - motrech [home] - motrech [blog] - motrech [liste] - frutch [liste]

View raw message