nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
Date Sat, 05 Apr 2014 08:31:16 GMT


Julien Nioche commented on NUTCH-1687:

I like the idea but am a bit concerned by the potential impact of : 

it = Iterables.cycle(queues.keySet()).iterator();

whenever a new FetchItemQueue is added. It will be called a lot at the beginning of a Fetch
when we create most of the queues and we'd create loads of iterator that would be overridden
straight away.

What about doing this lazily and trigger the generation of a new iterator only if getFetchItem()
is called and at least one FetchItemQueue has been added? 

I agree that in the middle of a Fetch, queues don't get added so often compared to calls to
getFetchItem() so not having to create an iterator there as we currently do would definitely
be a plus.

In extreme cases when there is a large diversity of hostnames / domains within a fetchlist
we could end up creating a new iterator for every new URL and would always start at the first
one anyway which is what we currently do so the new approach would not be worse anyway.

What do you think?

Also why not using Iterators.cycle() directly? 


> Pick queue in Round Robin
> -------------------------
>                 Key: NUTCH-1687
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Tien Nguyen Manh
>            Priority: Minor
>             Fix For: 1.9
>         Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
> Currently we chose queue to pick url from start of queues list, so queue at the start
of list have more change to be pick first, that can cause problem of long tail queue, which
only few queue available at the end which have many urls.
> public synchronized FetchItem getFetchItem() {
>       final Iterator<Map.Entry<String, FetchItemQueue>> it =
>         queues.entrySet().iterator(); ==> always reset to find queue from start
>       while (it.hasNext()) {
> ....
> I think it is better to pick queue in round robin, that can make reduce time to find
the available queue and make all queue was picked in round robin and if we use TopN during
generator there are no long tail queue at the end.

This message was sent by Atlassian JIRA

View raw message