nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
Date Mon, 22 Jun 2015 07:54:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595470#comment-14595470
] 

Sebastian Nagel commented on NUTCH-2038:
----------------------------------------

The scoring filter interface is complex, you're right, and not easy to understand. But scoring
filters are powerful and can do a lot of "magic" aside from pure "scoring", e.g., limiting
crawl by linkage depth and focused crawling. The ScoringFilter interface is complex because
it must fit into the Nutch workflow. In 2.x the interface is simpler because the workflow
and the underlying data structures are simpler (one web table vs. segments with multiple subdirectories).
Plugins should be lightweight in terms of using resources and it's surely not ideal if they
run MapReduce jobs (findDatumForUrl must do this in 1.x) or fetch content again via Tika.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message