nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
Date Fri, 10 Jun 2011 21:57:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047490#comment-13047490
] 

Ken Krugler commented on NUTCH-961:
-----------------------------------

The way that Boilerpipe in Tika works is that it acts as a delegate, processing the SAX events
generated by the default content handler that knows how to help clean up broken HTML.

So it's incremental processing (you don't need to get the full page first).

Separate note: Tika's Boilerpipe support now has an option to return HTML markup, so you could
run it in this mode to get anchors/anchor text.


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-tikaparser.patch,
NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message