nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujen Shah <sujen1...@gmail.com>
Subject Generate separate fetchlist by host
Date Mon, 15 Jun 2015 17:50:23 GMT
Hi Everyone,

I want to know if it possible to generate multiple fetchlists from the
generator by 'Host' or any other user specified criteria (like a regex) ?
If a single large fetchlist is generated, it causes the fetcher to run for
too long. It would be nice if the URLs could be in separate fetchlists
specified by some criteria making it easier to analyze large crawls and not
having to wait for the entire fetch job to finish.

I was reading the documentation at
http://wiki.apache.org/nutch/bin/nutch%20generate
The property numFetchers and maxNumSegments do talk about generating
multiple fetch partitions and segments.
And generate.max.count, generate.count.mode allow some configurations.

But I did not understand if it is possible to generate multiple fetchlists
(I am currently working in a local mode)

Thank you.

Regards,
Sujen Shah
M.S - Computer Science (Class of 2016)
University of Southern California
http://www.linkedin.com/in/sujenshah

Mime
View raw message