nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcin Okraszewski (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
Date Wed, 27 May 2009 21:28:26 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcin Okraszewski updated NUTCH-490:
-------------------------------------

    Attachment: NekoFilters_for_1.0.patch

Patch ported to Nutch 1.0. It includes the two previous patches.

> Extension point with filters for Neko HTML parser (with patch)
> --------------------------------------------------------------
>
>                 Key: NUTCH-490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-490
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Marcin Okraszewski
>            Priority: Minor
>         Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff
>
>
> In my project I need to set filters for Neko HTML parser. So instead of adding it hard
coded, I made an extension point to define filters for Neko. I was fallowing the code for
HtmlParser filters. In fact the method to get filters I think could be generalized to handle
both cases. But I didn't want to make too big mess.
> The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should
be applicable easily.
> BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point
itself. Now there are options for Neko and TagSoap. But if someone would like to use something
else or set give different settings for the parser, he would need to modify HtmlParser class,
instead of replacing a plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message