nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Hite" <>
Subject Re: problem with URLS/nutch
Date Mon, 23 Jun 2008 20:34:44 GMT
My understanding is Nutch is designed to use Hadoop to run in a distributed
fashion across many machines.  In order to scale across those machines,
Nutch needs to be able to accept inputs through shared memory that all nodes
in the cluster can read (this often means files on the Hadoop file system).
That's what is going on with the url directory step--the files in that
directory will be read and urls in those files will be injected into the
Nutch crawl database.  Since Nutch and Hadoop are tightly coupled, I don't
believe there is a way to invoke a Nutch crawl using a more traditional
input parameter like a String or a Map.  I think the best you can do is
programmatically generate a url seed file and then invoke the crawl.  Please
correct me if I'm wrong, but I don't there's any getting away from using
files as input and output parameters to Hadoop jobs.


On Mon, Jun 23, 2008 at 4:22 PM, All day coders <>

> Well if you want to add URL using the Nutch API then you should trace the
> program until you find the point where the directory containing the list of
> URL it's used for loading the list of URLs.
> On Mon, Jun 23, 2008 at 5:27 AM, yogesh somvanshi <
>> wrote:
>> Hello all
>> i m worrking on Nutch.
>> When u use standered crawl command  like :bin/nutch crawl urls -dir crawl
>> -depth 3 -topN 50
>> crawling  do well but i want to remove need of  that Url folder
>> i want to change or replace urls folder with some Array or map ,but when i
>> try to du some change to
>> code then i see it take help of Hadoop ...but i is to hard to create
>> change any other option
>> for that ..
>> Yogi

View raw message