nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriele Kahlout (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
Date Thu, 02 Jun 2011 07:21:47 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961v2.patch

Tested the patch against a checkout of 1.3 branch at revision 1101540, and made some trivial
changes to TikaParser code.
More interestingly I've also removed the following from parse-plugins.xml:

-        <mimeType name="application/xhtml+xml">
-		<plugin id="parse-html" />
-	</mimeType>
-

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch,
NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message