nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
Date Wed, 24 Jun 2015 22:20:04 GMT


Sebastian Nagel commented on NUTCH-2038:

Hi [~asitang], the latest pull request #36 looks good.
- maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482
- is this statement still true?
bq. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.
- afaics, the way the model is generated, stored and loaded needs a review:
-* it should be read/generated once and then cached in memory,
-* writing the model to disk is likely to become painful in distributed mode with concurrent
- cosmetics:
-* exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do
not get lost somewhere in stdout/stderr as of e.printStackTrace()
-* code formatting, see [[1|]]

> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>                 Key: NUTCH-2038
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If relevant then don't
filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any
of the important words from a list. If it does then let it pass.

This message was sent by Atlassian JIRA

View raw message