nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <>
Subject [jira] Updated: (NUTCH-570) Improvement of URL Ordering in
Date Wed, 21 May 2008 21:49:55 GMT


Otis Gospodnetic updated NUTCH-570:

    Assignee: Otis Gospodnetic

Another nudge for feedback from Ned or anyone else who tried this.
I've been using this patch without any problems, though I have not verified that it works
as advertised and that it really orders URLs in a more optimal way.


> Improvement of URL Ordering in
> ---------------------------------------------
>                 Key: NUTCH-570
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at
a time).  I found that the URLs generated are not optimal because they are simply randomized
by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.
 In comparison with old benchmarks I had set with regular this was at least 3
fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization,
but in order to get optimal ordering, urls from the same host should be as far apart in the
list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a
list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its
own class, but I figured it can go in and just add a flag in nutch-default.xml
determining if the user wants to use it.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message