nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriele Kahlout (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
Date Thu, 02 Jun 2011 09:58:47 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriele Kahlout updated NUTCH-961:
-----------------------------------

    Attachment: NUTCH-961v2.patch

cleaned up patch. 
To reproduce:
{code}
export NUTCH_HOME=`pwd`"/nutch"; svn co -r 1101540 http://svn.apache.org/repos/asf/nutch/branches/branch-1.3
$NUTCH_HOME
cp $MR_HOME/BoilerpipeExtractorRepository.java $NUTCH_HOME/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
cd $NUTCH_HOME; patch -p0 -ui $MR_HOME/NUTCH-961v2.patch
ant
{code}

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch,
NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message