nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2297) CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
Date Mon, 08 Aug 2016 12:11:20 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411716#comment-15411716
] 

Sebastian Nagel commented on NUTCH-2297:
----------------------------------------

The wrong values are already in the temporary output of the stats job:
# comment out {{fileSystem.delete(tmpFolder, true);}} in CrawlDbReader.processStatJobHelper(...)
# dump data in {{crawldb/stat_tmpXXXX}} via {{hadoop fs -text ...}}
While there is only one value for the minima (scn, fin, ftn) there are multiple values for
totals and maxima:
{noformat}
retry 1 148125397
retry 2 82761892
retry 3 41645830
scn     0
scx     7369
sct     14807601
scx     7110
sct     20791107
scx     8390
sct     13135199
... (scx and sct repeating)
scx     7010
sct     17505486
fin     15120000
fix     1360800
fit     1336710211200
fix     1360800
fit     1180199008800
...
fix     1360800
fit     1319982048000
ftn     597986250
ftx     26821441
ftt     35611037001815
ftx     26821441
...
{noformat}
The values for "fin" and "ftn" are already wrong at this point:
{noformat}
# 15120000 sec. = 175 days
% echo $((15120000/(60*60*24)))
175
# 597986250 as "epoche minutes":
% date -u --date=@$((597986250*60))
Thu Dec 20 05:30:00 UTC 3106
{noformat}
Need to trace what's going wrong in the CrawlDbStatMapper / CrawlDbStatCombiner / CrawlDbStatReducer.

> CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2297
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2297
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.13
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.13
>
>
> NUTCH-2286 added min, max and average for fetch interval and fetch time.
> When running in distributed mode (not reproducible in local mode), the values for the
minimum (earliest fetch time and shortest fetch interval) may be wrong with implausible values:
> {noformat}
> TOTAL urls: 7180518032
>  shortest fetch interval:    175 days, 00:00:00             <<<<<<
????
>  avg fetch interval: 10 days, 08:01:36
>  longest fetch interval:     15 days, 18:00:00
>  earliest fetch time:        Thu Dec 20 05:30:00 UTC 3106   <<<<<<
????
>  avg of fetch times: Fri Feb 19 00:07:00 UTC 2016
>  latest fetch time:  Mon Jul 18 05:22:00 UTC 2016
>  retry 0:    6907984913
>  retry 1:    148125397
>  retry 2:    82761892
>  retry 3:    41645830
>  min score:  0.0
>  avg score:  0.014360981
>  max score:  9.25
>  ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message