nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <>
Subject [jira] [Created] (NUTCH-2034) CrawlDB filtered documents counter.
Date Wed, 03 Jun 2015 19:21:37 GMT
Luis Lopez created NUTCH-2034:

             Summary: CrawlDB filtered documents counter.
                 Key: NUTCH-2034
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
    Affects Versions: 1.10
            Reporter: Luis Lopez
            Priority: Minor
             Fix For: 1.11

When we are doing big crawls we would like to know how many of the URLs are being discarded
by the regex filters, this is only presented in the Inject class:

Injector: Total number of urls rejected by filters: 0

It will be nice to have a counter in the CrawlDB class so we know in every round how many
were discarded by our filters:

CrawlDb update: Total number of URLs filtered by regex filters: 31415

This message was sent by Atlassian JIRA

View raw message