nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)
Date Tue, 10 Sep 2013 22:29:52 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763628#comment-13763628
] 

Markus Jelsma commented on NUTCH-1630:
--------------------------------------

Interesting, can you provide a patch for trunk as well?
                
> How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive
queue size) 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1630
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>              Labels: improvement
>             Fix For: 2.3
>
>         Attachments: NUTCH-1630.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait for a long
time for long lasting queues when shorter ones are finished. That means you may have to wait
for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static value.
However number of URLs to be fetched increases with each depth. Defining same length for all
queues does not mean all queues will finish around the same time. This problem has been addressed
by some other users before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our solution
can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based on the previous
fetches of that queue.
> We calculate this by:
>     FW=average response time of previous depth * number of urls in current queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of current
depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average response time
of that queue:
>     Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around the same
time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a few points
that are much higher than the rest. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message