nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
Date Mon, 29 Jun 2015 14:30:06 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605681#comment-14605681
] 

ASF GitHub Bot commented on NUTCH-2038:
---------------------------------------

GitHub user asitang opened a pull request:

    https://github.com/apache/nutch/pull/40

    NUTCH-2038

    added all the jars in plugin.xml

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/40.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #40
    
----
commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-17T16:11:42Z

    patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-17T16:14:37Z

    patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-17T16:35:28Z

    patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-18T15:09:30Z

    final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-19T20:13:34Z

    Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-24T15:45:50Z

    commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-24T15:46:46Z

    commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-24T15:55:22Z

    commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-24T15:58:12Z

    commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-24T17:31:09Z

    commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-25T22:59:45Z

    patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-25T23:00:40Z

    patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-25T23:05:20Z

    patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-25T23:07:44Z

    patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-28T23:51:58Z

    Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-28T23:53:22Z

    Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-28T23:53:52Z

    Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-29T00:03:43Z

    Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-29T04:21:38Z

    Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra <asitang@gmail.com>
Date:   2015-06-29T14:27:02Z

    Patch 5.2 for NUTCH-2038

----


> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If relevant then don't
filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any
of the important words from a list. If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message