nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Resolved: (NUTCH-503) Generator exits incorrectly for small fetchlists
Date Mon, 09 Jul 2007 06:48:04 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney resolved NUTCH-503.
---------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.8.2)
                   1.0.0
         Assignee: Doğacan Güney

Committed in rev. 554539 with style changes. 

I skipped the unit case part. But, we should consider bringing in hadoop test jar in the future
so that we can run test jobs in a distributed environment.

PS: Vishal, for future reference, nutch uses 2-space indents.

> Generator exits incorrectly for small fetchlists 
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small
fetchsizes. 
>  
>    After the first job finishes running, the generator checks the following condition
to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the
first partition might be empty, but some other partition(s) might contain urls. In this case,
the Generator is incorrectly assuming that all partitions are empty by just looking at the
first. This problem could also occur when all URLs in the fetchlist are from the same host
(or from a very small number of hosts, or from a number of hosts that all map to a small number
of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message