nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Content-Type inconsistency?
Date Tue, 02 May 2006 16:34:53 GMT
Jérôme Charron wrote:
>> We had to turn off
>> the guessing of content types to index Apache correctly.
> 
> Instead of turning off the guessing of content types you should only to
> remove the magic for xml in mime-types.xml

Perhaps that would have worked also, but, with Apache, simply trusting 
the declared Content-Type seems to work quite well.

>> I think we
>> shouldn't aim guess things any more than a browser does.  If browsers
>> require standards compliance, then our lives will be simpler.
> 
> Yes, but actually Nutch cannot acts as a browser.
> For instance with RSS: A browser know that a URL is a RSS feed because 
> there
> is a <link rel="alternate" type="..."/>
> with the correct content-type (application/rss+xml) in the refering HTML
> page.
> Nutch doesn't keep such informations for guessing a content-type (it could
> be a good think to add), so it must find the content-type from the URL
> (without any context).

Shouldn't RSS feeds declare the correct content-type?

http://feedvalidator.org/docs/warning/NonSpecificMediaType.html

I don't see that context should be required for feeds.

Doug

Mime
View raw message