nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
Date Wed, 24 Jun 2015 01:56:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598724#comment-14598724
] 

Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------

Yeah so here's the deal. I think I can implement a SimilarityUrlFilterPlugin that simply calls
Tika per URL. Tika is extremely fast and I could do e.g., Jaccard similarity on extracted
text features (e.g., do something like Gramming and/or TF-IDF or some other summarization
metric) and/or metadata features. This is basically what we did in my 572 class.

Asitang's idea about doing this with a ParseFilter in parse-tika is neat. I think this issue
should be updated to reflect that and I'll open a separate one to do my SimilarityUrlFilter
based on Tika. As long as its a plugin and someone is willing to support it as a PMC member
(aka me, etc.), there is no reason not to push forward with it. Asitang can move forward with
his ParseFilter and I'll review (and others can) what he produces.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message