nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
Date Mon, 22 Jun 2015 05:33:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595388#comment-14595388
] 

Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------

That's what we were working on. My 572 class in the Fall 2014 (and in Spring 2015) implemented
different versions of the above and it worked OK. I figured that I'd contribute it upstream
to Nutch - Asitang was one of the students so thought we could do both approaches. Furthermore
I realize URL filters are supposed to be fast, but they also present an understandable workflow.
We've always had people question scores, and they aren't as intuitive to me as "accept this
URL (or not)" - to me that's the basis of a domain specific, or Focused crawler. Indirectly
the score interface can also do this, I agree, but to me it's not as explicit as the URLFilter.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message