nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Anderson" <>
Subject Re: Crawler Data
Date Tue, 27 May 2008 14:17:17 GMT
I'm in a similar position. I'd like to be able to run arbitrary Hadoop
jobs across the pages saved by Nutch. This should be simple enough,
but I haven't found any direct documentation of how to yet.

Thanks in advance for any pointers.


On Tue, May 27, 2008 at 4:46 AM, Jorge Conejero Jarque <> wrote:
> I would like to make an application using the API Nutch, could extract data from the
pages before being indexed in the index and be able to do them in some kind of modification
or processing, because it can become something useful and interesting.
> The problem is that I can not find information on how to use the API part of the Crawl
> Only find exercises that are executed by the console with Cygwin and that only explain
aspects of configuration and creation of an index.
> If you could help me, with some examples.
> Thanks.
> Un saludo.
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300

Chris Anderson

View raw message