nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
Date Fri, 26 Feb 2016 11:04:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168821#comment-15168821
] 

ASF GitHub Bot commented on NUTCH-961:
--------------------------------------

GitHub user jeremie70 opened a pull request:

    https://github.com/apache/nutch/pull/92

    Add the boilerpipe parsing adapted from NUTCH-961

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jeremie70/nutch my-branch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #92
    
----
commit f185bc4461c57a1a85578de0ecf0884c7026c3a6
Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
Date:   2016-02-26T10:37:28Z

    improve parser with boilerpipe

commit 93ea2e51f444447be41ec93b2c0b0b61c117eeb3
Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
Date:   2016-02-26T10:37:28Z

    NUTCH-961 improve parser with boilerpipe

commit be91764fdf59d4f6930fc3211a84a252e5452674
Author: Jérémie Bourseau <jeremie.bourseau@xilopix.com>
Date:   2016-02-26T11:00:36Z

    Merge branch 'my-branch' of https://github.com/jeremie70/nutch into my-branch

----


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch,
NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch,
NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch,
NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> <property>
>   <name>tika.extractor</name>
>   <value>none</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or none.
>   </description>
> </property>
>  
> <property> 
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description> 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message