nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sherwood (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-826) Mailing list is broken.
Date Mon, 24 May 2010 06:29:23 GMT
Mailing list is broken.
-----------------------

                 Key: NUTCH-826
                 URL: https://issues.apache.org/jira/browse/NUTCH-826
             Project: Nutch
          Issue Type: Bug
            Reporter: John Sherwood
            Priority: Blocker


All of the following addresses are failing:

nutch-user@nutch.apache.org
nutch-user-subscribe@nutch.apache.org
nutch-user-subscribe@lucene.apache.org

For the last one, the mailer daemon said 
"This mailing list has moved to user at nutch.apache.org."

Below is the message I tried to send:

Hi people,

I've been banging my head against this problem for two days now.
Simply, I want to add a field with the value of a given meta tag.

I've been trying the parse-xml plugin, but that seems that it doesn't
work with version 1.0.  I've tried the code at
http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
and it hasn't worked.  I don't even know why.  I don't even know if my
plugin is being used... or even looked for!  Nutch seems to have a
infuriating "Fail silently" policy for plugins.  I put a
System.exit(1) in my filters just to see if my code is even being
encountered.  It has not in spite of my config telling it to.

Here's my config:
nutch-site.xml
...
<property>
 <name>plugin.includes</name>
 <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata</value>
</property>
...

parse-plugins.xml
...
<mimeType name="application/xhtml+xml">
   <plugin id="parse-html" />
   <plugin id="metadata" />
</mimeType>


<mimeType name="text/html">
      <plugin id="parse-html" />
      <plugin id="metadata" />
</mimeType>

<mimeType name="text/sgml">
      <plugin id="parse-html" />
      <plugin id="metadata" />
</mimeType>

<mimeType name="text/xml">
         <plugin id="parse-html" />
         <plugin id="parse-rss" />
        <plugin id="metadata" />
        <plugin id="feed" />
</mimeType>
...
<alias name="metadata"
extension-id="com.example.website.nutch.parsing.MetaTagExtractorParseFilter"
/>
...

I've also copied the plugin.xml and jar from my build/metadata to the
plugins root dir.

Nonetheless, Nutch runs and puts data in solr for me.  Afaik, Nutch is
completely unaware of my plugin despite my config options.  Is the
some other place I need to tell Nutch to use my plugin?  Is there some
other approach to do this without having to write a plugin?  This does
seem like a lot of work to simply get a meta tag into a field.  Any
help would be appreciated.

Sincerely,

John Sherwood

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message