nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.
Date Thu, 04 Jun 2015 16:49:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573114#comment-14573114
] 

Luis Lopez commented on NUTCH-2034:
-----------------------------------

Yes, we can use a general counter and say that or we could even be more specific and count
by filter.

> CrawlDB filtered documents counter.
> -----------------------------------
>
>                 Key: NUTCH-2034
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2034
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>            Priority: Minor
>              Labels: counters, crawldb, filter, info, regex
>             Fix For: 1.11
>
>
> When we are doing big crawls we would like to know how many of the URLs are being discarded
by the regex filters, this is only presented in the Inject class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every round how
many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message