nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Kingson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
Date Wed, 01 Apr 2015 21:59:54 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391558#comment-14391558
] 

Alexander Kingson commented on NUTCH-961:
-----------------------------------------

Hello,

Since I was not getting satisfactory results after upgrading to boilerpipe 1.2.0 with parse-tika
(with boilerpipe support)  I have put some code to nutch-2.x parser to get the same results
as the boilerpipe demo-website. Used some code from .v2.patch. 
Attaching the patch.

Thanks.
Alex.

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch,
NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch,
NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message