nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Talat UYARER (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2003) topN is not work correctly
Date Wed, 29 Apr 2015 09:33:05 GMT
Talat UYARER created NUTCH-2003:
-----------------------------------

             Summary: topN is not work correctly
                 Key: NUTCH-2003
                 URL: https://issues.apache.org/jira/browse/NUTCH-2003
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Talat UYARER
            Priority: Minor


I want to crawl top 1000 urls which are ordered by scores from webpage table. It doesnt work
correctly. 

When I use topN parameter,  it is divided by map task counts (topN/ maptaskcounts = maptasktopN)
Every map tasks generate maptasktopN urls of map tasks. Assume as I have 25 map tasks and
I set topN parameter as 1000 and maptasktopN is calculated as 40. As Result We dont have top
1000 highest scored urls, we have 1000 urls of generated 40 highest scored urls per 25 map
tasks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message