nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
Date Fri, 24 Apr 2009 11:12:31 GMT


Julien Nioche commented on NUTCH-477:

Having a scope for the URL filters could be useful in cases where we want to do a focused
crawl. If for instance we want to parse a limited number of domains we could have different
filters to use in ParseOutputFormat (so that we keep some of the outgoing links using the
usual prefix and suffix filters for instance) and in CrawlDBFilter so that we keep only the
URLs matching our limited set of domains.

Another way of doing would be to have a different set of filters for the Generation to fetch
only within the domains of interest but keep all URLs in the crawlDB. 

Of course we can have custom scorers to give a low score to URLS we don't want to fetch and
set a threshold in the Generation, but IMHO being able to do that with the filters would be
more elegant

> Extend URLFilters to support different filtering chains
> -------------------------------------------------------
>                 Key: NUTCH-477
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>         Attachments: urlfilters.patch
> I propose to make the following changes to URLFilters:
> * extend URLFilters so that they support different filtering rules depending on the context
where they are executed. This functionality mirrors the one that URLNormalizers already support.
> * change their return value to an int code, in order to support early termination of
long filtering chains.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message