nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Mattmann" <Chris.Mattm...@jpl.nasa.gov>
Subject RE: Huge Problem trying to develop plugin for Nutch
Date Sat, 26 Mar 2005 03:49:39 GMT
Hi,

 

 For whatever reason (maybe file filtering) I think that my test2.java file
that I attached didn't go through. So, I renamed the extension to .txt.
Let's see if it goes through this time.

 

Sorry about having to send another email. Thanks very much for any help!

 

Cheers,

  Chris

 

 

  _____  

From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov] 
Sent: Friday, March 25, 2005 6:09 PM
To: nutch-dev@incubator.apache.org
Cc: 'Ellis Horowitz'; paul.ramirez@jpl.nasa.gov;
chris.mattmann@jpl.nasa.gov; 'Rami A. Al-Ghanmi'; mattmann@usc.edu
Subject: Huge Problem trying to develop plugin for Nutch

 

Hi Folks,

 

 My name is Chris Mattmann: I work at the Jet Propulsion Laboratory in
Pasadena, CA, U.S.A. I'm new to the list. Nice to meet you all.

 

I am having some * major * trouble trying to build an RSS content parser
plugin for nutch. My plugin is based on the parse-pdf plugin structure and
uses the apache commons-feedparser library out of the Jakarta sandbox to try
and parse rss feeds and send them to nutch for indexing. The probem that I
am having is * very * strange. Basically after about 2 days of going around
the Nutch source code I've tracked my problem down to basically the fact
that for whatever reason, the jdom.jar library the commons-feedparser relies
on, is not accessible via the Nutch Plugin runtime. I keep getting the same
error whenever I run the crawler to crawl Rss pages. I've set up a dummy web
page with a single link to an rss file. Here's the webpage:

 

http://baron.pagemewhen.com:8080/~chris/hi.rss

 

 

So, basically then I seed my crawler with the
baron.pagemewhen.com:8080/~chris/ webpage, and then tell it to go get the
content and start parsing via the ./bin/nutch crawl command. So, then when
it's crawling I get the attached output in the nutch-crawl-log.txt
file(along with print lines that I've inserted into the nutch source code
myself so I can see what's happening: these are denoted by the (&(&(& CHRIS
variants). I've went round and round in the
PluginRepository/PluginDescriptor classes in the net.nutch.plugin package,
and I pretty now fully understand how everything is working and how the
pluginclassloader is loading the classes. You can even see in my log file
that it got all the correct classes in my classpath. The files are located
in the right directory, as all the class path urls to the jar files that it
captured I have verified. Further, I wrote a test2.java program that
simulates dynamically loading the rss parser class uses an URLCLassLoader,
and for whatever strange reason, that same code works! Just not in Nutch.
I've attached this program as well for your convenience (the test2.java)
attachment. I've also attached my rss plugin, along with dependent jar files
in the plugin structure, along with my plugin.xml file. The plugin is
located here in a zip file:
http://baron.pagemewhen.com:8080/~chris/parse-rss.zip . Can someone please
give me some idea as to what I'm doing wrong here???? I am so frustrated I'm
pulling my hair out :-(

 

 

Further, while purusing the PluginManifestParser class looking for a
solution to my problem, I believe that I have found a bug. First off, I
wanted to let you know that I'm working with the 0.6 version of Nutch. So,
inside the PluginManifestParser, I found the place where it's loading the
libraries. Well, when it looks for the "export" sub-element in the library
element within the plugin.xml file, there is actually a typo that is causing
it to not function correctly on exported libraries. The typo is the
following:

 

  /**

   * @param rootElement

   * @param pluginDescriptor

   */

  private static void parseLibraries(Element pRootElement,

                                     PluginDescriptor pDescriptor) throws
MalformedURLException {

    Element runtime = pRootElement.element("runtime");

    if (runtime == null)

      return;

    List libraries = runtime.elements("library");

    for (int i = 0; i < libraries.size(); i++) {

      Element library = (Element) libraries.get(i);

      String libName = library.attributeValue("name");

 

      //@Bug Fix

      //By: Chris Mattmann and Paul Ramirez

 

      Element exportElement = library.element("export"); //used to read
"extport"

      if (exportElement != null)

        pDescriptor.addExportedLibRelative(libName);

      else

        pDescriptor.addNotExportedLibRelative(libName);

    }

  }

 

So, basically the xml child element name that it was looking for was
misspelled. Since I don't know how to commit to the Nutch source tree (or if
it's even allowed), I just wanted to pass this bug fix (I think it's a bug
correct me if I'm wrong) to you guys. I'm also very interested in becoming a
committer/helping out on the project. I think it's really cool.

 

 

So, yeah if you guys could help me with my plugin problem, it would be much
appreciated. I'm doing this RSS plugin as part of my Cs599: Seminar on
Search Engines Ph.D. course at the University of Southern California.

 

Thanks a lot.

 

Cheers,

  Chris

 

 

 

 

 

 

______________________________________________

Chris A. Mattmann

Chris.Mattmann@jpl.nasa.gov 

Staff Member

Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

 

_________________________________________________

Jet Propulsion Laboratory            Pasadena, CA

Office: 171-266B                        Mailstop:  171-246

Phone:  818-354-8810

_______________________________________________________

 

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 

 

> -----Original Message-----

> From: John X [mailto:john@neasys.com]

> Sent: Friday, March 25, 2005 6:24 PM

> To: nutch-dev@incubator.apache.org; J?r?me Charron

> Cc: john@neasys.com

> Subject: Re: Mime/Magic mapper

> 

> On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote:

> > Does somebody know why John Xing deactivate the mime.magic.file

> > support in protocol-file plugin?

> 

> The "disabled" are only hooks to use mimetype/magic mapper.

> The mapper I used in a project had license issue (can't be redistributed).

> There is no mapper code in nutch. That's about one year ago.

> If you know one without license issue, I will be happy to add it in.

> 

> John

> 

> > I'm writing an mbox-parser plugin, and typically, an mbox has no

> > extension => it's mime type could not be determined using

> > extension/mime-type mapper.

> > For an mbox, the mime-type can only be defined by "analyzing" the file

> > content (using a mime-type/magic mapper).

> >

> > Thanks

> >

> > J?r?me

> >

> >

> >

> > --

> > http://motrech.free.fr/ - motrech [home]

> > http://motrech.blogspot.com/ - motrech [blog]

> > http://fr.groups.yahoo.com/group/motrech - motrech [liste]

> > http://fr.groups.yahoo.com/group/frutch - frutch [liste]

> >

> __________________________________________

> http://www.neasys.com - A Good Place to Be

> Come to visit us today!


Mime
View raw message