nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.
Date Wed, 03 Jun 2015 21:41:40 GMT


Sebastian Nagel commented on NUTCH-2034:

Thanks, good idea! But strictly speaking we don't know which URL filter has rejected an URL.
Should be: "Total number of URLs filtered by URL filters: ..."

> CrawlDB filtered documents counter.
> -----------------------------------
>                 Key: NUTCH-2034
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>            Priority: Minor
>              Labels: counters, crawldb, filter, info, regex
>             Fix For: 1.11
> When we are doing big crawls we would like to know how many of the URLs are being discarded
by the regex filters, this is only presented in the Inject class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every round how
many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415

This message was sent by Atlassian JIRA

View raw message