nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrea Spinelli (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
Date Tue, 04 Dec 2007 11:15:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548217
] 

Andrea Spinelli commented on NUTCH-585:
---------------------------------------

I absolutely agree that a more general solution is needed; however, I think that some of the
Nutch current users might benefit from a quick fix.

If there is no opposition, I could submit a patch (less than 20 lines)

On the other hand,anybody thinks that blocking selected portions of text could pose serious
architectural or stability risks?

About the more general solution, do you think there is a viable path from here to there?

-- andrea


> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Priority: Minor
>
> We are using nutch to index our own web sites; we would like not to index certain parts
of our pages, because we know they are not relevant (for instance, there are several links
to change the background color) and generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML comments,
like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment strings as
constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop
in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any expression
of  interest - or for an explanation why waht we are doing is plain wrong!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message