nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
Date Fri, 26 Jun 2015 08:21:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602553#comment-14602553
] 

Sebastian Nagel commented on NUTCH-2038:
----------------------------------------

Great [~asitangm]! I'll tried to run it via parsechecker and also within a small crawl.
* there is still one {{e.printStackTrace();}} :)
* if the plugin is activated in plugin.included but not configured:
{noformat}
2015-06-26 09:33:24,174 ERROR naivebayes.NaiveBayesParseFilter - ParseFilter: NaiveBayes:
trainfile or wordlist not set in the parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist
2015-06-26 09:33:24,175 WARN  parse.ParseSegment - Error parsing: file:/home/wastl/work/websearch/crawler/nutch/src/plugin/parse-exorbyte/sample/subdocuments1-html5.html:
java.lang.IllegalArgumentException: ParseFilter: NaiveBayes: trainfile or wordlist not set
in the parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist
        at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:120)
{noformat}
A plugin propagated in the description of plugin.includes should optimally work out-of-the-box.
You could add train/word file templates to conf/ containing a few trivial ham/spam examples.
They are then instantiated and installed into runtime/ and users could just modify them.
* there should be a clear error message if a configured file fails to load (e.g., "Failed
to load naivebayes-train.txt configured in parsefilter.naivebayes.trainfile: ...") instead
of
{noformat}
Exception in thread "main" java.lang.NullPointerException
        at java.io.Reader.<init>(Reader.java:78)
        at java.io.BufferedReader.<init>(BufferedReader.java:94)
        at java.io.BufferedReader.<init>(BufferedReader.java:109)
        at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:129)
{noformat}
* finally, the JobRunner crashed with::
{noformat}
2015-06-26 09:48:50,762 INFO  naivebayes.NaiveBayesParseFilter - Training the Naive Bayes
Model
2015-06-26 09:48:50,764 WARN  mapred.LocalJobRunner - job_local1978281032_0001
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer
        at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:94)
        at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:142)
{noformat}
 That's probably caused because the dependencies are not listed in the plugin.xml.

> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If relevant then don't
filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any
of the important words from a list. If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message