nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Url post filtering
Date Fri, 26 Sep 2014 11:17:17 GMT
Hi - you can use different regex files at the indexing stage, see nutch-default for the configuration
directive and use -Dparam=val to override the default regex-urlfilter.txt file at indexing
stage.
Markus

 
 
-----Original message-----
> From:Albinscode <albinscode@gmail.com>
> Sent: Friday 26th September 2014 11:25
> To: dev@nutch.apache.org
> Subject: Url post filtering
> 
> Hello everybody,
> 
> I'm used to filter urls before fetch operation by using regex-filter
> to avoid crawling the world wide web.
> 
> I've got a specific need: one main page giving all urls to crawl. I
> want to crawl the main page to have outlinks but I dont want to index
> this page. How can I proceed?
> 
> I could enable this feature in my specific plugin but I want to be
> sure nothing is already existing as ever ;)
> Dirty solution would be to delete this main page url in the generated
> solr index with a json query but yeah this is really dirty ;)
> 
> Hope I'm clear.
> 

Mime
View raw message