nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <>
Subject RE: Url post filtering
Date Fri, 26 Sep 2014 11:17:17 GMT
Hi - you can use different regex files at the indexing stage, see nutch-default for the configuration
directive and use -Dparam=val to override the default regex-urlfilter.txt file at indexing

-----Original message-----
> From:Albinscode <>
> Sent: Friday 26th September 2014 11:25
> To:
> Subject: Url post filtering
> Hello everybody,
> I'm used to filter urls before fetch operation by using regex-filter
> to avoid crawling the world wide web.
> I've got a specific need: one main page giving all urls to crawl. I
> want to crawl the main page to have outlinks but I dont want to index
> this page. How can I proceed?
> I could enable this feature in my specific plugin but I want to be
> sure nothing is already existing as ever ;)
> Dirty solution would be to delete this main page url in the generated
> solr index with a json query but yeah this is really dirty ;)
> Hope I'm clear.

View raw message