nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tien Nguyen Manh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support
Date Tue, 26 Jan 2016 06:58:39 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116772#comment-15116772
] 

Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM:
-----------------------------------------------------------------

AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can use just 1
parse?


was (Author: tiennm):
AH yes, Could you explain why we need to parse it twice?

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch,
NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch,
NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch,
NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message