nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Content-Type inconsistency?
Date Thu, 27 Apr 2006 21:48:36 GMT
Jérôme Charron wrote:
> Finaly it is a good news that Nutch seems to be more "intelligent" on
> content-type guessing than Firefox or IE, no?

I'm not so sure.  When crawling Apache we had trouble with this feature. 
  Some HTML files that had an XML header and the server identified as 
"text/html" Nutch decided to treat as XML, not HTML.  We had to turn off 
the guessing of content types to index Apache correctly.  I think we 
shouldn't aim guess things any more than a browser does.  If browsers 
require standards compliance, then our lives will be simpler.


View raw message