nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrea Spinelli (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
Date Thu, 29 Nov 2007 11:13:43 GMT
[PARSE-HTML plugin] Block certain parts of HTML code from being indexed
-----------------------------------------------------------------------

                 Key: NUTCH-585
                 URL: https://issues.apache.org/jira/browse/NUTCH-585
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
         Environment: All operating systems
            Reporter: Andrea Spinelli
            Priority: Minor


We are using nutch to index our own web sites; we would like not to index certain parts of
our pages, because we know they are not relevant (for instance, there are several links to
change the background color) and generate spurious matches.

We have modified the plugin so that it ignores HTML code between certain HTML comments, like
<!-- START-IGNORE -->
... ignored part ...
<!-- STOP-IGNORE -->

We feel this might be useful to someone else, maybe factorizing the comment strings as constants
in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).

We are almost ready to contribute our code snippet.  Looking forward for any expression of
 interest - or for an explanation why waht we are doing is plain wrong!



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message