nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-961) Expose Tika's boilerpipe support
Date Thu, 27 Jan 2011 14:16:44 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987575#action_12987575
] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Boilerpipe comes with several algorithms for stripping away the boilerplate content. Although
the ArticleExtractor is recommended, it certainly fails for many types of pages. Pages such
as news overviews with blocks and lists are much better extracted with the CanolaExtractor
instead. This poses a problem, we cannot have just one single configuration directive telling
the parser which extractor to use for a whole crawl.

Some thoughts on how to deal with it:
- use Boilerpipe's estimator to automatically determine which extractor to use
- have a facility to override false positives returned by the estimator and hardcode which
extractor to use for URL groups (not unlike the subcollection plugin)


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate
content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message