nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kangas <kan...@gmail.com>
Subject Re: Urlfilter Patch
Date Thu, 01 Dec 2005 21:30:48 GMT
The latter is not strictly true. Nutch could issue an HTTP HEAD  
before the HTTP GET, and determine the mime-type before actually  
grabbing the content.

It's not how Nutch works now, but this might be more useful than a  
super-detailed set of regexes...

kangas@kangas-dev:~$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Date: Thu, 01 Dec 2005 21:25:38 GMT
Server: Apache/2.0
Connection: close
Content-Type: text/html; charset=UTF-8

Connection closed by foreign host



On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote:

> Chris Mattmann wrote:
>>   In principle, the mimeType system should give us some guidance on
>> determining the appropriate mimeType for the content, regardless  
>> of whether
>> it ends in .foo, .bar or the like.
>
> Right, but the URL filters run long before we know the mime type,  
> in order to try to keep us from fetching lots of stuff we can't  
> process. The mime type is not known until we've fetched it.
>
> Doug

--
Matt Kangas / kangas@gmail.com



Mime
View raw message