nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Closed] (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch
Date Fri, 01 Apr 2011 14:41:06 GMT


Markus Jelsma closed NUTCH-396.

    Resolution: Won't Fix

> mergesegs sorts URLs, making segments useless for subsequent fetch
> ------------------------------------------------------------------
>                 Key: NUTCH-396
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8
>         Environment: Mac OS X 10.4.7
>            Reporter: Doug Cook
>            Priority: Minor
> Mergesegs leaves the output segment in URL-sorted order.
> This is a problem if the segment was just generated and not yet fetched - the fetcher
likes the URLs to be in essentially random order (sort by URL hash or similar). If I fetch
a segment created by mergesegs, my performance is extremely poor since all URLs from a given
host will be grouped together and the per-host delays kill me.
> I have a local fix which I am using: map using a key of MD5(URL) + URL, then, during
the reduce phase, chop the MD5 off the front to get the original URL. This is simple, has
essentially random order, no problems with collisions, and seems to work nicely.
> The only thing I don't know is whether or not there is some other tool expecting the
sorted order (I would expect not, since generate does not produce this). Right now I have
my fix as an option (-randomize), but if there is no other tool requiring sorted order, it's
probably cleaner to just make this non-optional.
> Thoughts?

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message