nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "eyal edri" <eyal.e...@gmail.com>
Subject writing a new parse-exe plugin
Date Wed, 17 Oct 2007 13:53:54 GMT
Hi all,

I'm trying to write a new plugin that will download pages with contentType:
x-dosexec (EXE) files.
i've followed the "write your own plugin tutorial" in the wiki and done the
following actions: (some actions are not mentioned in the tutorial)

   1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
   2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml [displayed
   below]
   3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml [displayed
   below]
   4. Written the java code
   $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
   5. Add "nutch-extensionpoints" & "parse-exe" to the 'plugins-include'
   property in $NUTCH_HOME/conf/nutch-site.xml
   6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written below]
   7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
   below]
   8. copied $NUTCH_HOME/build/plugins/parse-exe/parse-exe.jar to
   $NUTCH_HOME/plugins/parse-exe
   9. run ant (build successful)

the log shows that nutch identifies the plugin:

2007-10-17 15:15:55,657 INFO  plugin.PluginRepository - Registered Plugins:
2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2007-10-17 15:15:55,657 INFO  plugin.PluginRepository -         Exe Parse
Plug-in (parse-exe)

but when the fetcher encounters a x-dosexec file it thorws an exception:

2007-10-17 15:17:16,146 WARN  parse.ParseUtil - No suitable parser found
when trying to parse content http://www.foo.com/yyy/foo.exe of type
application/x-dosexec
2007-10-17 15:17:16,146 WARN  fetcher.Fetcher - Error parsing:
http://www.foo.com/yyy/foo.exe: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe

(sorry, but the url has been masked for security reasons)

Am i missing something??

thanks !!



[$NUTCH_HOME/src/plugins/build.xml]

<ant dir="parse-exe" target="deploy"/>

[parse-plugins.xml]

 <mimeType name="application/x-dosexec">
                <plugin id="parse-exe" />
  </mimeType>


[plugin.xml] // copied and changed from parse-pdf

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="parse-exe"
   name="Exe Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">

   <runtime>
      <library name="parse-exe.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-log4j"/>
   </requires>

   <extension id="org.apache.nutch.parse.exe"
              name="ExeParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="org.apache.nutch.parse.exe.ExeParse"
                      class="org.apache.nutch.parse.exe.ExeParse">
        <parameter name="contentType" value="application/x-dosexec"/>
        <parameter name="pathSuffix"  value=""/>
      </implementation>
   </extension>

</plugin>

-----------------------------------------------------------------------------------------------------------------

[build.xml]

<?xml version="1.0"?>

<project name="parse-exe" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

------------------------------------------------------------------------
[ExeParser.java]

public class ExeParser implements Parser {
  public static final Log LOG = LogFactory.getLog("
org.apache.nutch.parse.exe");
  private Configuration conf;

  public Parse getParse(Content content) {

    try {

      byte[] raw = content.getContent();

      // enter here my code ( i will replace this with real code)
      LOG.info ("EDRI:: you have reached the parse-exe plugin!");
      System.out.println("EDRI:: system.out.print... parse-exe");




      String contentLength = content.getMetadata().get(
Response.CONTENT_LENGTH);
      if (contentLength != null && raw.length !=
Integer.parseInt(contentLength))
{
          return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_TRUNCATED,
                  "Content truncated at "+raw.length
            +" bytes. Parser can't handle incomplete exe
file.").getEmptyParse(getConf());
      }

    } catch (Exception e) { // run time exception
        if (LOG.isWarnEnabled()) {
          LOG.warn("General exception in EXE parser: "+e.getMessage());
          e.printStackTrace(LogUtil.getWarnStream(LOG));
        }
        return new ParseStatus(ParseStatus.FAILED,
              "Can't be handled as exe document. " +
e).getEmptyParse(getConf());
      }

    /// i'm not sure what to return here if i only need to d/l the file

    ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, "",null,
null, null);
    parseData.setConf(this.conf);
    return new ParseImpl("", parseData);
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }





-- 
Eyal Edri

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message