nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: howto skip hiddens ulrs inside div tag?
Date Tue, 06 Sep 2005 10:25:31 GMT
Massimo Miccoli wrote:
> Hi nutch dev,
> 
> After fetching about 100 mio of pages I see many search engine spammers
> that use an hidden div tag (negative position) to include many urls
> that user don't see whe acces the site page. This links alter the boost
> (by inlink count) so I want to skip this urls.
> How can I do that?

Implement an HtmlParseFilter, similar to creativecommons plugin. This 
plugin will remove matching tags.

In fact, if you have some spare cycles, you could implement a more 
generic "html cleanup" plugin, where you could specify a list of XPaths 
to match (and optionally replace).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message