nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <jerome.char...@gmail.com>
Subject Re: RSS Parser Bug!?
Date Thu, 08 Sep 2005 14:22:15 GMT
> I'm not necessarily sure that this is a "bug" per se: it's just the fact 
> that several different content types are potentially possible when any ol' 
> webserver returns an RSS file. To be honest, I performed a pretty detailed 
> crawl (100s of thousands of pages) when I originally wrote the plugin way 
> back in March/April of this year, and the two content types that you see in 
> the code right now that it checks for are what I found to be the most 
> pervasive content type returned from webservers for RSS. However, in no way 
> did I mean for that list to be exhaustive: for instance, web servers may 
> also return "application/rss", or "text/rss", or even "text/plain" I have 
> seen for RSS. It all depends on how the webmaster has configured the web 
> server. So it's kind of difficult to accurately and reliably discriminate 
> against the content type within a parser plugin itself, because it is 
> inherently out of the parsers hands what gets returned for a particular type 
> of file, and even though th!
> ere are some "best practices" for what should be returned for different 
> file types, there is by no means any "standards", that must be followed.
> 
> So, I would propose the following. I believe the checking for the content 
> type and then throwing an exception block of code exists in other plugins in 
> Nutch as well. I propose we nix that, and remove the content type checking 
> and exception message from the plugins themselves, and move it up to a 
> higher level, i.e., the actually plugin factory or something. Let it get 
> taken care of there, and let it be configurable, out of the code of each 
> plugin for instance. Because that way, I believe you can customize whatever 
> plugin to do whatever your need is, * without * having to recompile the code 
> just to add another accepted content type to a plugin so it doesn't throw an 
> error message.
> 
> What say you guys? :-)

That's compliant with the other discussion on this point : 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message