nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CHRIS A MATTMANN <Chris.A.Mattm...@jpl.nasa.gov>
Subject Re: Huge Problem trying to develop plugin for Nutch
Date Sat, 26 Mar 2005 21:32:34 GMT
Hi John,

  I posted it earlier as a .txt file, but since it's small I could just include it in this
email:


import java.net.URL;
import java.net.URLClassLoader;


public class test2{

    public test2(){}

    public static void main (String [] args) throws Exception{
        test2 t = new test2();
	URL [] theURLs = new URL[7];

	theURLs[5] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/saxpath.jar");
	theURLs[6] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar");
	//theURLs[5] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar");
	theURLs[0] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/parse-rss.jar");
	theURLs[1] = new URL("file:/home/chris/cs599-search-engines/nutch/build/classes/");
        theURLs[2] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar");
	theURLs[3] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/log4j-1.2.6.jar");
	theURLs[4] = new URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-httpclient-3.0-beta1.jar");
	System.out.println(t.getClass().getName());
	URLClassLoader theLoader = new URLClassLoader(theURLs, t.getClass().getClassLoader());
	//	theLoader.loadClass("org.jdom.Document");

	Class c = theLoader.loadClass("net.nutch.parse.rss.RSSParser");
	Object o = (c.getConstructors()[0]).newInstance(null);
	c.getMethod("testMain",null).invoke(o,null); //this works fine!
        
	//org.jdom.Document d = new org.jdom.Document();
    }

}

There it is. Of course, to run it, you will need to have those jar files that I am dynamically
loading, along with making sure that the path to those jar files works on whatever system
that you are running it on. You will also need the parse-rss.jar file that has the RSS Parser
Plugin that I'm currently working on for Nutch. You can get that from the following url: http://baron.pagemewhen.com:8080/~chris/parse-rss.zip

Then, just give it a compile and it should work. Basically the issue that I'm having is that
I can get the feedparser working fine, in it's own standalone program. I can run it from the
command line, or from another standalone program as I'm demonstrating above. However, when
I am trying to use the feedparser as part of my parse-rss plugin, it chokes when the Parse
getParse method is called, because it involves calls to the feedparser. The error message
that I'm receiving (as shown in the original crawl log file that I posted to the group) says
something to the effect of: ClassNotFoundException: org/jdom/Document, which is a class that
the feedparser depends on. The weird thing is however, if you look at my log file, the PluginRepository
read the necessary jar file (jdom.jar) from the plugin.xml file (within the library sub-element
of runtime), from the correct path on my system (as indicated by ensuring that it's the same
path that works in my test2.java program above). I e
ven put println's and LOG.info messages everywhere to ensure that the PluginClassLoader loads
all the necessary jar urls (from its constructor), when the getClassLoader method is called
in the PluginRepository getPluginInstance method is called. I see it loading the jdom.jar
file. However, when the feedparser gets called from my plugin when nutch is trying to getContent
from an rss file, the feedparser chokes cuz it can't find the org/jdom/Document class.

That's basically my problem in a nutshell. Sorry for the long-windedness (is that a word?
:-) ), but I just wanted to be as thorough as I could real fast when explaining the extent
to which I've investigated this problem that I'm having.

Any help anyone could provide would be much appreciated.

Thanks,
 Chris


----- Original Message -----
From: John X <john@neasys.com>
Date: Saturday, March 26, 2005 2:00 pm
Subject: Re: Huge Problem trying to develop plugin for Nutch

> On Sat, Mar 26, 2005 at 01:13:33PM -0800, CHRIS A MATTMANN wrote:
> > Hi John,
> > 
> >  Thanks for your reply. Actually I already have the feedparser 
> working from the command line. I also included a program, 
> test2.java with my original email that shows how I can dynamically 
> load the class and call the feedparser method. So, I actually 
> already have that tool.
> 
> Could you post your working command line tool?
> 
> John
> 
> > 
> > Any help on this issue would be greatly appreciated.
> > 
> > Thanks,
> >   Chris
> > 
> > 
> > ----- Original Message -----
> > From: John X <john@neasys.com>
> > Date: Saturday, March 26, 2005 1:07 am
> > Subject: Re: Huge Problem trying to develop plugin for Nutch
> > 
> > > Why try it the hard way? You may want to
> > > create a simple tool, just calling feedparser to parse your 
> hi.rss?> > Have that work first, then worry about dynamic loading 
> and nutch 
> > > plugin system.
> > > Let us know when you have the simple tool.
> > > 
> > > John
> > > 
> > > On Fri, Mar 25, 2005 at 06:08:50PM -0800, Chris Mattmann wrote:
> > > > Hi Folks,
> > > > 
> > > >  
> > > > 
> > > >  My name is Chris Mattmann: I work at the Jet Propulsion 
> > > Laboratory in
> > > > Pasadena, CA, U.S.A. I'm new to the list. Nice to meet you all.
> > > > 
> > > >  
> > > > 
> > > > I am having some * major * trouble trying to build an RSS 
> > > content parser
> > > > plugin for nutch. My plugin is based on the parse-pdf plugin 
> > > structure and
> > > > uses the apache commons-feedparser library out of the 
> Jakarta 
> > > sandbox to try
> > > > and parse rss feeds and send them to nutch for indexing. The 
> > > probem that I
> > > > am having is * very * strange. Basically after about 2 days 
> of 
> > > going around
> > > > the Nutch source code I've tracked my problem down to 
> basically 
> > > the fact
> > > > that for whatever reason, the jdom.jar library the commons-
> > > feedparser relies
> > > > on, is not accessible via the Nutch Plugin runtime. I keep 
> > > getting the same
> > > > error whenever I run the crawler to crawl Rss pages. I've 
> set up 
> > > a dummy web
> > > > page with a single link to an rss file. Here's the webpage:
> > > > 
> > > >  
> > > > 
> > > 
> > 
> > 
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
> 


Mime
View raw message