nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
Date Thu, 23 Nov 2006 10:56:04 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-331?page=comments#action_12452199 ] 
            
Andrzej Bialecki  commented on NUTCH-331:
-----------------------------------------

My analysis of the problem was incorrect - in fact this was most likely caused by problems
in Generator. I haven't experienced this problem since the Generator issue was fixed - so
I'm closing this issue for now.

> Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-331
>                 URL: http://issues.apache.org/jira/browse/NUTCH-331
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0
>            Reporter: Andrzej Bialecki 
>            Priority: Critical
>             Fix For: 0.9.0
>
>
> Each Fetcher task starts multiple FetcherThreads, which consume the input fetchlist.
These threads may block for a long time after being started and after reading their input
fetchlist entries, due to "politeness" settings. However, the map-reduce framework considers
the task as complete when all input data is read.
> This causes the tasktracker to incorreclty assume that task processing is complete (because
the task progress is 1.0, since all input has been consumed), whereas many URLs from the fetchlist
may still be waiting for fetching, in blocked threads. The more threads is used the more apparent
is this problem, because the final number of fetched pages may be short of the target number
by as many as (numThreads * numMapTasks) entries.
> The final result of this is that only a part of the fetchlist is fetched, because Fetcher
map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message