nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <>
Subject [jira] [Resolved] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
Date Mon, 29 Jun 2015 05:16:05 GMT


Chris A. Mattmann resolved NUTCH-2038.
    Resolution: Fixed

alright [~asitang] all committed! Fixed the ParserFactoryTest error. Thanks!

[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "fix for NUTCH-2038: Naive Bayes classifier
based html Parse filter (for filtering outlinks) contributed by Asitang Mishra <>
this closes #39"
Sending        .gitignore
Sending        build.xml
Sending        conf/nutch-default.xml
Sending        ivy/ivy.xml
Sending        src/plugin/build.xml
Adding         src/plugin/parsefilter-naivebayes
Adding         src/plugin/parsefilter-naivebayes/build.xml
Adding         src/plugin/parsefilter-naivebayes/ivy.xml
Adding         src/plugin/parsefilter-naivebayes/plugin.xml
Adding         src/plugin/parsefilter-naivebayes/src
Adding         src/plugin/parsefilter-naivebayes/src/java
Adding         src/plugin/parsefilter-naivebayes/src/java/org
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/
Adding         src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/
Transmitting file data ...........
Committed revision 1688084.
[chipotle:~/tmp/nutch-trunk] mattmann% 
Great work and thanks to you and Seb and others!

> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>                 Key: NUTCH-2038
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If relevant then don't
filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any
of the important words from a list. If it does then let it pass.

This message was sent by Atlassian JIRA

View raw message