nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2034) CrawlDB filtered documents counter.
Date Thu, 11 Feb 2016 12:05:18 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lewis John McGibbney updated NUTCH-2034:
----------------------------------------
    Fix Version/s: 1.12

> CrawlDB filtered documents counter.
> -----------------------------------
>
>                 Key: NUTCH-2034
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2034
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>            Priority: Minor
>              Labels: counters, crawldb, filter, info, regex
>             Fix For: 1.12
>
>
> When we are doing big crawls we would like to know how many of the URLs are being discarded
by the regex filters, this is only presented in the Inject class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every round how
many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message