nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
Date Thu, 18 Jun 2015 15:22:02 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591958#comment-14591958
] 

ASF GitHub Bot commented on NUTCH-2038:
---------------------------------------

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/32#discussion_r32741673
  
    --- Diff: ivy/ivy.xml ---
    @@ -78,7 +78,11 @@
                     <dependency org="org.apache.cxf" name="cxf-rt-transports-http-jetty"
rev="3.0.4"/>
                     <dependency org="com.fasterxml.jackson.core" name="jackson-databind"
rev="2.5.1" /> 
                     <dependency org="com.fasterxml.jackson.dataformat" name="jackson-dataformat-cbor"
rev="2.5.1" />
    -                <dependency org="com.fasterxml.jackson.jaxrs" name="jackson-jaxrs-json-provider"
rev="2.5.1" />	
    +                <dependency org="com.fasterxml.jackson.jaxrs" name="jackson-jaxrs-json-provider"
rev="2.5.1" />
    +                <dependency org="org.apache.mahout" name="mahout-math" rev="0.8" />
    --- End diff --
    
    Hi Asitang,
    I get Your point however I am also trying OO help you get your path into
    the codebase.  Nutch is a crawler... Adding machine learning  and indexing
    components such as Mahout (what if someone does not wish to use Mahout) and
    Lucene (what if someone wishes to use ES) back into the core codebase
    dependency tree is, on this occasion,, not the right way to go.
    If you can ease send a new pull request we can take a look. Excellent work
    thank you :)
    
    On Thursday, June 18, 2015, asitang <notifications@github.com> wrote:
    
    > In ivy/ivy.xml
    > <https://github.com/apache/nutch/pull/32#discussion_r32741196>:
    >
    > > @@ -78,7 +78,11 @@
    > >                  <dependency org="org.apache.cxf" name="cxf-rt-transports-http-jetty"
rev="3.0.4"/>
    > >                  <dependency org="com.fasterxml.jackson.core" name="jackson-databind"
rev="2.5.1" />
    > >                  <dependency org="com.fasterxml.jackson.dataformat" name="jackson-dataformat-cbor"
rev="2.5.1" />
    > > -                <dependency org="com.fasterxml.jackson.jaxrs" name="jackson-jaxrs-json-provider"
rev="2.5.1" />	
    > > +                <dependency org="com.fasterxml.jackson.jaxrs" name="jackson-jaxrs-json-provider"
rev="2.5.1" />
    > > +                <dependency org="org.apache.mahout" name="mahout-math" rev="0.8"
/>
    >
    > Was trying to pave the way for a machine learning library into nutch, so
    > that anyone can use that in future
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/nutch/pull/32/files#r32741196>.
    >
    
    
    -- 
    *Lewis*



> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will keep only
those urls that contain some "hot words" provided again in a list.) from that pages that are
classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message