nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
Date Fri, 01 Oct 2010 15:24:33 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916912#action_12916912
] 

Andrzej Bialecki  commented on NUTCH-864:
-----------------------------------------

I think the difficulty comes from the simplification in 2.x as compared to 1.x, in that we
keep a single status per page. In 1.x a side-effect of having two locations with two statuses
(one "db status" in crawldb and one "fetch status" in segments) was that we had more information
in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary fetch status,
as determined by fetcher, and a final (reconciled) status as determined by updatedb, based
on the knoweldge of not only plain fetch status and old status but also possible redirects.
If I'm not mistaken currently the status is immediately overwritten by fetcher, even before
we come to updatedb, hence the problem..

> Fetcher generates entries with status 0
> ---------------------------------------
>
>                 Key: NUTCH-864
>                 URL: https://issues.apache.org/jira/browse/NUTCH-864
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>         Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>            Reporter: Julien Nioche
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient:     ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient:     SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient:     GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient:     TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient:     EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient:     MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient:     NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:	2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0:	2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:	0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:	0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:	1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null):	649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):	1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):	112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):	93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):	138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):	521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message