nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kangas <>
Subject Re: Urlfilter Patch
Date Fri, 02 Dec 2005 20:17:37 GMT

After sleeping on this idea, I realized that there's a middle ground  
that may give us (and website operators) the best of both worlds.

The question: how to avoid fetching unparseable content?

Value in answering this:
- save crawl operators bandwidth, disk space, cpu time
- save website operators bandwidth (and maybe cpu time) = be better  
web citizens

Tools availble:
- regex-urlfilter.txt (nearly free to run, but is only an approximate  
- HTTP HEAD before GET (cheaper than blind GET, but mainly saves  
bandwidth, not server cpu)

Proposed strategy:

1) Define regex-urlfilter.txt, as we do now. Continue to weed out  
known-unparseable file extensions as early as possible.
2) Also define another regex list for extensions that are very likely  
to be text/html. (e.g. .html, .php).
Fetch these blindly with HTTP GET.
3) For everything else, perform HTTP HEAD first. If the mime-type is  
unparseable, do not follow with HTTP GET.

Advantages to this approach:
- still weeds out known-bad stuff as early as possible
- saves crawl+server bandwidth in questionable cases
- saves server load in high-confidence cases (eliminates HTTP HEAD)

Disadvantages: ?

On Dec 1, 2005, at 5:23 PM, Matt Kangas wrote:

> Totally agreed. Neither approach replaces the other. I just wanted  
> to mention possibility so people don't over-focus on trying to  
> build a hyper-optimized regex list. :)
> For the content provider, an HTTP HEAD request saves them bandwidth  
> if we don't do a GET. That's some cost savings for them over doing  
> a blind fetch (esp. if we discard it).
> I guess the question is, what's worse:
> - two server hits when we find content we want?, or
> - spending bandwidth on pages that the Nutch installation will  
> ignore anyway?
> --matt
> On Dec 1, 2005, at 4:40 PM, Doug Cutting wrote:
>> Matt Kangas wrote:
>>> The latter is not strictly true. Nutch could issue an HTTP HEAD   
>>> before the HTTP GET, and determine the mime-type before actually   
>>> grabbing the content.
>>> It's not how Nutch works now, but this might be more useful than  
>>> a  super-detailed set of regexes...
>> This could be a useful addition, but it could not replace url- 
>> based filters.  A HEAD request must still be polite, so this could  
>> substantially slow fetching, as it would incur more delays.  Also,  
>> for most dynamic pages, a HEAD is as expensive for the server as a  
>> GET, so this would cause more load on servers.
>> Doug
> --
> Matt Kangas /

Matt Kangas /

View raw message