nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
Date Wed, 17 Jun 2015 16:51:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590080#comment-14590080
] 

Asitang Mishra commented on NUTCH-2038:
---------------------------------------

Have made a pull request for a rather uncouth patch. This initial patch is mainly to show
the idea and get some reviews.


IDEA:
Two tier architecture for filtering:
The filter is called from the parser and looks at the current page that was parsed. Does a
NB classification on the text of the page and decided if it is relevant or not. If relevant
then let all the outlinks pass, if not then the second check kicks in, which checks for some
"hotwords" in the outlink urls itself (from a wordlist provided by the user). If a match ten
let it pass. 



HOW TO USE:
Activate the model filter in the plugin.includes property:

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-(model|regex)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>
  </description>
</property>

You need to set some properties in the nutch-site.xml like :
<property>
  <name>parser.modelfilter.trainfile</name>
  <value>train/tweets-train.tsv</value>
  <description>
  </description>
</property>

<property>
  <name>parser.modelfilter.dictionaryfile</name>
  <value>wordlist.txt</value>
  <description>
  </description>
</property>

<property>
  <name>parser.modelfilter</name>
  <value>true</value>
  <description>
  </description>
</property>



TRAINING FILE:
Keep the training file in a "train" named folder in local. Keep the wordlist in the conf
The format of the training file is as follows:

1 21312123 I am feeling happy
1 34354646 how are you
0 35345435 can i get some coffee

these are tab \t seperates values in each line. 
<class/target--can be either 1(relevent) or 0(irrelevent)><TAB><Unique ID for
each line--need to be given by the user><TAB><TEXT>



WORDLIST:

Can be a list of words one in each line like:

atmosphere
java
python





> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier (using a model provided).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message