nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: RSS Parser Bug!?
Date Thu, 08 Sep 2005 15:39:48 GMT
Hi Jack,

 Wow, that's a weird error. I'm not exactly sure what's causing that, let me
look at the stack trace you provided and get back to you at some point on
that. As for your 2nd question:

> My question is that can parse-rss support "application/xml" or more
> content-type?

The answer to that is a resounding "yes". Parse-rss, being based on the
commmons-feedparser can support the following (taken from the feedparser
site, http://jakarta.apache.org/commons/sandbox/feedparser/):

"Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support
all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future
versions) as well as easy ad hoc extension and RSS 1.0 modules capability."

In my experience using the feedparser, I have found this to be true as well.
The mimeType (what you list as "application/xml" above) is just what is
returned by the webserver to describe the content type of the RSS file,
parse-rss's support for RSS is really outside the scope of the mimeType - it
really has to do with the schema of the RSS being used, and as stated above,
parse-rss can support several different feed schemas (i.e., RSS 1.x, 2.x,
ATOM, etc.)

Looking at the stack trace that you provided below, I'm not even sure that
it's making it to the point where the parse-rss plugin is getting called -
however, I'll have a look and see if I can figure out what your stack trace
is being caused by. Stay tuned...


Cheers,
  Chris




On 9/8/05 8:19 AM, "Jack Tang" <himars@gmail.com> wrote:

> Hi Chris
> 
> Thanks for your explain.
> I wanna let "application/xml" content type go in parse-rss plugin, so
> I add the statement
> 
>         if (contentType != null
>                 && (!contentType.startsWith("text/xml") &&
> !contentType.startsWith("application/rss+xml") &&
> !contentType.startsWith("application/xml")))
>             return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
>                     "Content-Type not text/xml, application/xml or
> application/rss+xml: "
>                             + contentType).getEmptyParse();
> 
> 
>  But unfortunately, it failed again.  Here is the error message:
> ------------------------------------------------------------------------------
> -------------------------------------------
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.proxy.host = null
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.proxy.port = 8080
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.timeout = 10000
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.content.limit = 65536
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.agent = NutchCVS/0.06-dev (Nutch;
> http://www.nutch.org/docs/en/bot.html;
> nutch-agent@lists.sourceforge.net)
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.auth.ntlm.username =
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> fetcher.server.delay = 1000
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.max.delays = 100
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] - Configured
> Client
> 050908 231023 org.apache.nutch.fetcher.Fetcher$FetcherThread [11] -
> SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
> at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
> at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
> at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
> at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:262)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error
> logged.  Exiting fetcher.
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> at net.recruit.fetch.JobCrawlTool.main(JobCrawlTool.java:150)
> 
> It seems plugins confliction?
> My question is that can parse-rss support "application/xml" or more
> content-type?
> 
> Thanks
> /Jack
> 
> On 9/8/05, CHRIS A MATTMANN <Chris.A.Mattmann@jpl.nasa.gov> wrote:
>> Hi Jack,
>> 
>>   I'm not necessarily sure that this is a "bug" per se: it's just the fact
>> that several different content types are potentially possible when any ol'
>> webserver returns an RSS file. To be honest, I performed a pretty detailed
>> crawl (100s of thousands of pages) when I originally wrote the plugin way
>> back in March/April of this year, and the two content types that you see in
>> the code right now that it checks for are what I found to be the most
>> pervasive content type returned from webservers for RSS. However, in no way
>> did I mean for that list to be exhaustive: for instance, web servers may also
>> return "application/rss", or "text/rss", or even "text/plain" I have seen for
>> RSS. It all depends on how the webmaster has configured the web server. So
>> it's kind of difficult to accurately and reliably discriminate against the
>> content type within a parser plugin itself, because it is inherently out of
>> the parsers hands what gets returned for a particular type of file, and even
>> though th!
>>  ere are some "best practices" for what should be returned for different file
>> types, there is by no means any "standards", that must be followed.
>> 
>> So, I would propose the following. I believe the checking for the content
>> type and then throwing an exception block of code exists in other plugins in
>> Nutch as well. I propose we nix that, and remove the content type checking
>> and exception message from the plugins themselves, and move it up to a higher
>> level, i.e., the actually plugin factory or something. Let it get taken care
>> of there, and let it be configurable, out of the code of each plugin for
>> instance. Because that way, I believe you can customize whatever plugin to do
>> whatever your need is, * without * having to recompile the code just to add
>> another accepted content type to a plugin so it doesn't throw an error
>> message.
>> 
>> What say you guys? :-)
>> 
>> Cheers,
>>   Chris
>> 
>> 
>> ----- Original Message -----
>> From: Jack Tang <himars@gmail.com>
>> Date: Wednesday, September 7, 2005 10:58 pm
>> Subject: RSS Parser Bug!?
>> 
>>> Hi Guys
>>> 
>>> Did someone install parse-rss and try to fetch rss feeds?
>>> It failed on my side. I enabled the plugin and it fetched, not rss
>>> parser didnot work.
>>> My feed is http://www.craigslist.org/evs/index.rss
>>> 
>>> Here is the error:
>>> 
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
>>> can't parse http://beijing.craigslist.org/jjj/index.rss, reason:
>>> failed(2,203): Content-Type not text/html: application/xml;
>>> charset=ISO-8859-1
>>> 
>>> The content-type is application/xml. Mattmann's comment is this:
>>>        // check that contentType is one we can handle
>>>        String contentType = content.getContentType();
>>>        if (contentType != null
>>>                && (!contentType.startsWith("text/xml") &&
>>> !contentType.startsWith("application/rss+xml")))
>>>            return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
>>>                    "Content-Type not text/xml or
>>> application/rss+xml: "
>>>                            + contentType).getEmptyParse();
>>> 
>>> So, it does not "application/xml" content type yet?
>>> 
>>> 
>>> Thanks
>>> /Jack
>>> --
>>> Keep Discovering ... ...
>>> http://www.jroller.com/page/jmars
>>> 
>> 
>> 
> 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 




Mime
View raw message