nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based url filter
Date Mon, 22 Jun 2015 17:10:02 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239
] 

Asitang Mishra edited comment on NUTCH-2038 at 6/22/15 5:09 PM:
----------------------------------------------------------------

>From what I understand the problem is that a url filter in nutch has a very simple interface
(has no provision for content) and is only "fired" in the generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a general plugin
from the core (right now in my patch, it is more visible than a general plugin).
2>Should be a url filter and not a scoring filter to preserve the simplicity and transparency
of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job and also
not lightweight. 
2> Should be a scoring filter as the interface in place already supports such an improvement.


I may suggest that if we all agree to let it be a url filter (and that's completely up to
you guys) then what I can do is either enhance the already present urlfilter interface or
make an abstract class (which will very generic and has a filter function that takes some
args and a string)
And call all the url filters from parser as well, but this time not fire the original filter()
function (keep it for the generator). Fire the new filter function from the parser. That way
the only viable change in NUTCH will be that now parser will also be calling urlfilters (And
this will be very generic). That way we also don't need to read the crawl db or call tika
for my specific filter.


was (Author: asitang):
>From what I understand the problem is that a url filter in nutch has a very simple interface
(has no provision for content) and is only "fired" in the generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a general plugin
from the core.
2>Should be a url filter and not a scoring filter to preserve the simplicity and transparency
of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job and also
not lightweight. 
2> Should be a scoring filter as the interface in place already supports such an improvement.


I may suggest that if we all agree to let it be a url filter (and that's completely up to
you guys) then what I can do is either enhance the already present urlfilter interface or
make an abstract class (which will very generic and has a filter function that takes some
args and a string)
And call all the url filters from parser as well, but this time not fire the original filter()
function (keep it for the generator). Fire the new filter function from the parser. That way
the only viable change in NUTCH will be that now parser will also be calling urlfilters (And
this will be very generic). That way we also don't need to read the crawl db or call tika
for my specific filter.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message