nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
Date Fri, 08 Dec 2017 16:06:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283767#comment-16283767
] 

ASF GitHub Bot commented on NUTCH-2455:
---------------------------------------

okedoki opened a new pull request #254: fix for NUTCH-2455 more efficient usage of hostdb
in generate
URL: https://github.com/apache/nutch/pull/254
 
 
   Three questions/modification left open:
   1) In several places we use url.getHost() in the Nutch code, in other we use url.getHost().toLower().
Why?
   2) public static class ScoreHostKeyComparator extends WritableComparator should Implement
Raw comparator. If you know how to do it you are welcome to do.
   3) The whole Generator file is to big, it should be spread to several files. Again, if
you know how to fix it in a good way, you are welcome. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the Selector
job, with a partitioner and secondary sorting so that all keys with same host end up in the
same call of the reducer. If values can also hold a HostDb entry and the sort comparator guarantees
that the HostDb entry (entries if partitioned by domain or IP) comes in front of all CrawlDb
entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message