nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Kingson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
Date Wed, 01 Apr 2015 22:00:54 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Kingson updated NUTCH-961:
------------------------------------
    Attachment: nutch-2.x-boilerpipe.patch

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch,
NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch,
NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch,
nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message