nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <>
Subject RE: Urlfilter Patch
Date Thu, 01 Dec 2005 22:04:51 GMT
Hi Doug,

> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
> It's not how Nutch works now, but this might be more useful than a 
> super-detailed set of regexes...

I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...


View raw message